* [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max
@ 2024-07-29 2:35 Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 2:35 UTC (permalink / raw)
To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao
Background
==========
In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.
Investigation
=============
Duration my investigation on this issue, I found the latency spikes were
caused by the zone->lock contention. That can be illustrated as follows,
CPU A (Freer) CPU B (Allocator)
lock zone->lock
free pages lock zone->lock
unlock zone->lock
alloc pages
unlock zone->lock
If the Freer holds the zone->lock for an extended period, the Allocator
has to wait and thus latency spikes occures.
I also wrote a python script to reproduce it on my test servers. See the
dedails in patch #3. It is worth to note that the reproducer is based on
the upstream kernel.
Experimenting
=============
As the more pages to be freed in one batch, the long the duration will
be. So my attempt involves reducing the batch size. After I restrict the
batch to the smallest size, there is no complains on the latency spikes
any more.
However, duration my experiment, I found that the
CONFIG_PCP_BATCH_SCALE_MAX is hard to use in practice. So I try to
improve it in this series.
The Proposal
============
This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.
Future work
===========
To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[0], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[1]. However, implementing
these solutions is likely to necessitate a more extended development
effort.
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]
Changes:
- v1-> v2: Commit log refinement
- v1: mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
https://lwn.net/Articles/981069/
- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the
minimum pagelist
https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/
Yafang Shao (3):
mm/page_alloc: A minor fix to the calculation of pcp->free_count
mm/page_alloc: Avoid changing pcp->high decaying when adjusting
CONFIG_PCP_BATCH_SCALE_MAX
mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++
mm/Kconfig | 11 -------
mm/page_alloc.c | 40 ++++++++++++++++++-------
3 files changed, 47 insertions(+), 21 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count
2024-07-29 2:35 [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-07-29 2:35 ` Yafang Shao
2024-07-29 2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29 2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 2:35 UTC (permalink / raw)
To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao
Currently, At worst, the pcp->free_count can be
(batch - 1 + (1 << MAX_ORDER)), which may exceed the expected max value of
(batch << CONFIG_PCP_BATCH_SCALE_MAX).
This issue was identified through code review, and no real problems have
been observed.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..d2ea2721f6a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2557,7 +2557,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
- pcp->free_count += (1 << order);
+ pcp->free_count = min(pcp->free_count + (1 << order),
+ batch << CONFIG_PCP_BATCH_SCALE_MAX);
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
--
2.43.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX
2024-07-29 2:35 [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
@ 2024-07-29 2:35 ` Yafang Shao
2024-07-29 2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 2:35 UTC (permalink / raw)
To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao
When adjusting the CONFIG_PCP_BATCH_SCALE_MAX configuration from its
default value of 5 to a lower value, such as 0, it's important to ensure
that the pcp->high decaying is not inadvertently slowed down. Similarly,
when increasing CONFIG_PCP_BATCH_SCALE_MAX to a larger value, like 6, we
must avoid inadvertently increasing the number of pages freed in
free_pcppages_bulk() as a result of this change.
So below improvements are made:
- hardcode the default value of 5 to avoiding modifying the pcp->high
- change free_pcppages_bulk() calling into multiple steps
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
mm/page_alloc.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2ea2721f6a6..bfd44b65777c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2270,7 +2270,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
int high_min, to_drain, batch;
- int todo = 0;
+ int todo = 0, count = 0;
high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
@@ -2280,18 +2280,26 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
* control latency. This caps pcp->high decrement too.
*/
if (pcp->high > high_min) {
- pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+ /*
+ * We will decay 1/8 pcp->high each time in general, so that the
+ * idle PCP pages can be returned to buddy system timely. To
+ * control the max latency of decay, we also constrain the
+ * number pages freed each time.
+ */
+ pcp->high = max3(pcp->count - (batch << 5),
pcp->high - (pcp->high >> 3), high_min);
if (pcp->high > high_min)
todo++;
}
to_drain = pcp->count - pcp->high;
- if (to_drain > 0) {
+ while (count < to_drain) {
spin_lock(&pcp->lock);
- free_pcppages_bulk(zone, to_drain, pcp, 0);
+ free_pcppages_bulk(zone, min(batch, to_drain - count), pcp, 0);
spin_unlock(&pcp->lock);
+ count += batch;
todo++;
+ cond_resched();
}
return todo;
--
2.43.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 2:35 [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29 2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
@ 2024-07-29 2:35 ` Yafang Shao
2024-07-29 3:18 ` Huang, Ying
2 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 2:35 UTC (permalink / raw)
To: akpm
Cc: ying.huang, mgorman, linux-mm, Yafang Shao, Matthew Wilcox,
David Rientjes
During my recent work to resolve latency spikes caused by zone->lock
contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
in practice.
To demonstrate this, I wrote a Python script:
import mmap
size = 6 * 1024**3
while True:
mm = mmap.mmap(-1, size)
mm[:] = b'\xff' * size
mm.close()
Run this script 10 times in parallel and measure the allocation latency by
measuring the duration of rmqueue_bulk() with the BCC tools
funclatency[1]:
funclatency -T -i 600 rmqueue_bulk
Here are the results for both AMD and Intel CPUs.
AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
=====================================================================
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 12 | |
1024 -> 2047 : 9116 | |
2048 -> 4095 : 2004 | |
4096 -> 8191 : 2497 | |
8192 -> 16383 : 2127 | |
16384 -> 32767 : 2483 | |
32768 -> 65535 : 10102 | |
65536 -> 131071 : 212730 |******************* |
131072 -> 262143 : 314692 |***************************** |
262144 -> 524287 : 430058 |****************************************|
524288 -> 1048575 : 224032 |******************** |
1048576 -> 2097151 : 73567 |****** |
2097152 -> 4194303 : 17079 |* |
4194304 -> 8388607 : 3900 | |
8388608 -> 16777215 : 750 | |
16777216 -> 33554431 : 88 | |
33554432 -> 67108863 : 2 | |
avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
The avg alloc latency can be 449us, and the max latency can be higher
than 30ms.
- Value set to 0
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 92 | |
1024 -> 2047 : 8594 | |
2048 -> 4095 : 2042818 |****** |
4096 -> 8191 : 8737624 |************************** |
8192 -> 16383 : 13147872 |****************************************|
16384 -> 32767 : 8799951 |************************** |
32768 -> 65535 : 2879715 |******** |
65536 -> 131071 : 659600 |** |
131072 -> 262143 : 204004 | |
262144 -> 524287 : 78246 | |
524288 -> 1048575 : 30800 | |
1048576 -> 2097151 : 12251 | |
2097152 -> 4194303 : 2950 | |
4194304 -> 8388607 : 78 | |
avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
The avg was reduced significantly to 19us, and the max latency is reduced
to less than 8ms.
- Conclusion
On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
latency. Latency-sensitive applications will benefit from this tuning.
However, I don't have access to other types of AMD CPUs, so I was unable to
test it on different AMD models.
Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
============================================================
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 2419 | |
1024 -> 2047 : 34499 |* |
2048 -> 4095 : 4272 | |
4096 -> 8191 : 9035 | |
8192 -> 16383 : 4374 | |
16384 -> 32767 : 2963 | |
32768 -> 65535 : 6407 | |
65536 -> 131071 : 884806 |****************************************|
131072 -> 262143 : 145931 |****** |
262144 -> 524287 : 13406 | |
524288 -> 1048575 : 1874 | |
1048576 -> 2097151 : 249 | |
2097152 -> 4194303 : 28 | |
avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
- Conclusion
This Intel CPU works fine with the default setting.
Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
==============================================================
Using the cpuset cgroup, we can restrict the test script to run on NUMA
node 0 only.
- Default value of 5
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 46 | |
512 -> 1023 : 695 | |
1024 -> 2047 : 19950 |* |
2048 -> 4095 : 1788 | |
4096 -> 8191 : 3392 | |
8192 -> 16383 : 2569 | |
16384 -> 32767 : 2619 | |
32768 -> 65535 : 3809 | |
65536 -> 131071 : 616182 |****************************************|
131072 -> 262143 : 295587 |******************* |
262144 -> 524287 : 75357 |**** |
524288 -> 1048575 : 15471 |* |
1048576 -> 2097151 : 2939 | |
2097152 -> 4194303 : 243 | |
4194304 -> 8388607 : 3 | |
avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
The zone->lock contention becomes severe when there is only a single NUMA
node. The average latency is approximately 144us, with the maximum
latency exceeding 4ms.
- Value set to 0
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 24 | |
512 -> 1023 : 2686 | |
1024 -> 2047 : 10246 | |
2048 -> 4095 : 4061529 |********* |
4096 -> 8191 : 16894971 |****************************************|
8192 -> 16383 : 6279310 |************** |
16384 -> 32767 : 1658240 |*** |
32768 -> 65535 : 445760 |* |
65536 -> 131071 : 110817 | |
131072 -> 262143 : 20279 | |
262144 -> 524287 : 4176 | |
524288 -> 1048575 : 436 | |
1048576 -> 2097151 : 8 | |
2097152 -> 4194303 : 2 | |
avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
After setting it to 0, the avg latency is reduced to around 8us, and the
max latency is less than 4ms.
- Conclusion
On this Intel CPU, this tuning doesn't help much. Latency-sensitive
applications work well with the default setting.
It is worth noting that all the above data were tested using the upstream
kernel.
Why introduce a systl knob?
===========================
From the above data, it's clear that different CPU types have varying
allocation latencies concerning zone->lock contention. Typically, people
don't release individual kernel packages for each type of x86_64 CPU.
Furthermore, for latency-insensitive applications, we can keep the default
setting for better throughput. In our production environment, we set this
value to 0 for applications running on Kubernetes servers while keeping it
at the default value of 5 for other applications like big data. It's not
common to release individual kernel packages for each application.
Future work
===========
To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[2], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[3]. However, implementing
these solutions is likely to necessitate a more extended development
effort.
Link: https://lwn.net/Articles/981069/ [0]
Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [1]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
---
Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++
mm/Kconfig | 11 -----------
mm/page_alloc.c | 23 +++++++++++++++++------
3 files changed, 34 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..aa29f2fdad7c 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -65,6 +65,7 @@ Currently, these files are in /proc/sys/vm:
- page-cluster
- page_lock_unfairness
- panic_on_oom
+- pcp_batch_scale_max
- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
@@ -845,6 +846,22 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.
+pcp_batch_scale_max
+===================
+
+In page allocator, PCP (Per-CPU pageset) is refilled and drained in
+batches. The batch number is scaled automatically to improve page
+allocation/free throughput. But too large scale factor may hurt
+latency. This option sets the upper limit of scale factor to limit
+the maximum latency.
+
+The range for this parameter spans from 0 to 6, with a default value of 5.
+The value assigned to 'N' signifies that during each refilling or draining
+process, a maximum of (batch << N) pages will be involved, where "batch"
+represents the default batch size automatically computed by the kernel for
+each zone.
+
+
percpu_pagelist_high_fraction
=============================
diff --git a/mm/Kconfig b/mm/Kconfig
index b4cb45255a54..41fe4c13b7ac 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
-config PCP_BATCH_SCALE_MAX
- int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
- default 5
- range 0 6
- help
- In page allocator, PCP (Per-CPU pageset) is refilled and drained in
- batches. The batch number is scaled automatically to improve page
- allocation/free throughput. But too large scale factor may hurt
- latency. This option sets the upper limit of scale factor to limit
- the maximum latency.
-
config PHYS_ADDR_T_64BIT
def_bool 64BIT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfd44b65777c..8d6f9dc99387 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,8 @@ int min_free_kbytes = 1024;
int user_min_free_kbytes = -1;
static int watermark_boost_factor __read_mostly = 15000;
static int watermark_scale_factor = 10;
+static int pcp_batch_scale_max = 5;
+static int sysctl_6 = 6;
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
@@ -2334,7 +2336,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
int count = READ_ONCE(pcp->count);
while (count) {
- int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ int to_drain = min(count, pcp->batch << pcp_batch_scale_max);
count -= to_drain;
spin_lock(&pcp->lock);
@@ -2462,7 +2464,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
/* Free as much as possible if batch freeing high-order pages. */
if (unlikely(free_high))
- return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ return min(pcp->count, batch << pcp_batch_scale_max);
/* Check for PCP disabled or boot pageset */
if (unlikely(high < batch))
@@ -2494,7 +2496,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return 0;
if (unlikely(free_high)) {
- pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+ pcp->high = max(high - (batch << pcp_batch_scale_max),
high_min);
return 0;
}
@@ -2564,9 +2566,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
- if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+ if (pcp->free_count < (batch << pcp_batch_scale_max))
pcp->free_count = min(pcp->free_count + (1 << order),
- batch << CONFIG_PCP_BATCH_SCALE_MAX);
+ batch << pcp_batch_scale_max);
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2908,7 +2910,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
* subsequent allocation of order-0 pages without any freeing.
*/
if (batch <= max_nr_alloc &&
- pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+ pcp->alloc_factor < pcp_batch_scale_max)
pcp->alloc_factor++;
batch = min(batch, max_nr_alloc);
}
@@ -6275,6 +6277,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
.proc_handler = percpu_pagelist_high_fraction_sysctl_handler,
.extra1 = SYSCTL_ZERO,
},
+ {
+ .procname = "pcp_batch_scale_max",
+ .data = &pcp_batch_scale_max,
+ .maxlen = sizeof(pcp_batch_scale_max),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = &sysctl_6,
+ },
{
.procname = "lowmem_reserve_ratio",
.data = &sysctl_lowmem_reserve_ratio,
--
2.43.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-07-29 3:18 ` Huang, Ying
2024-07-29 3:40 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2024-07-29 3:18 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
Hi, Yafang,
Yafang Shao <laoar.shao@gmail.com> writes:
> During my recent work to resolve latency spikes caused by zone->lock
> contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> in practice.
As we discussed before [1], I still feel confusing about the description
about zone->lock contention. How about change the description to
something like,
Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock. If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.
[1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> To demonstrate this, I wrote a Python script:
>
> import mmap
>
> size = 6 * 1024**3
>
> while True:
> mm = mmap.mmap(-1, size)
> mm[:] = b'\xff' * size
> mm.close()
>
> Run this script 10 times in parallel and measure the allocation latency by
> measuring the duration of rmqueue_bulk() with the BCC tools
> funclatency[1]:
>
> funclatency -T -i 600 rmqueue_bulk
>
> Here are the results for both AMD and Intel CPUs.
>
> AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> =====================================================================
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 12 | |
> 1024 -> 2047 : 9116 | |
> 2048 -> 4095 : 2004 | |
> 4096 -> 8191 : 2497 | |
> 8192 -> 16383 : 2127 | |
> 16384 -> 32767 : 2483 | |
> 32768 -> 65535 : 10102 | |
> 65536 -> 131071 : 212730 |******************* |
> 131072 -> 262143 : 314692 |***************************** |
> 262144 -> 524287 : 430058 |****************************************|
> 524288 -> 1048575 : 224032 |******************** |
> 1048576 -> 2097151 : 73567 |****** |
> 2097152 -> 4194303 : 17079 |* |
> 4194304 -> 8388607 : 3900 | |
> 8388608 -> 16777215 : 750 | |
> 16777216 -> 33554431 : 88 | |
> 33554432 -> 67108863 : 2 | |
>
> avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>
> The avg alloc latency can be 449us, and the max latency can be higher
> than 30ms.
>
> - Value set to 0
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 92 | |
> 1024 -> 2047 : 8594 | |
> 2048 -> 4095 : 2042818 |****** |
> 4096 -> 8191 : 8737624 |************************** |
> 8192 -> 16383 : 13147872 |****************************************|
> 16384 -> 32767 : 8799951 |************************** |
> 32768 -> 65535 : 2879715 |******** |
> 65536 -> 131071 : 659600 |** |
> 131072 -> 262143 : 204004 | |
> 262144 -> 524287 : 78246 | |
> 524288 -> 1048575 : 30800 | |
> 1048576 -> 2097151 : 12251 | |
> 2097152 -> 4194303 : 2950 | |
> 4194304 -> 8388607 : 78 | |
>
> avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>
> The avg was reduced significantly to 19us, and the max latency is reduced
> to less than 8ms.
>
> - Conclusion
>
> On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> latency. Latency-sensitive applications will benefit from this tuning.
>
> However, I don't have access to other types of AMD CPUs, so I was unable to
> test it on different AMD models.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> ============================================================
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 0 | |
> 512 -> 1023 : 2419 | |
> 1024 -> 2047 : 34499 |* |
> 2048 -> 4095 : 4272 | |
> 4096 -> 8191 : 9035 | |
> 8192 -> 16383 : 4374 | |
> 16384 -> 32767 : 2963 | |
> 32768 -> 65535 : 6407 | |
> 65536 -> 131071 : 884806 |****************************************|
> 131072 -> 262143 : 145931 |****** |
> 262144 -> 524287 : 13406 | |
> 524288 -> 1048575 : 1874 | |
> 1048576 -> 2097151 : 249 | |
> 2097152 -> 4194303 : 28 | |
>
> avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>
> - Conclusion
>
> This Intel CPU works fine with the default setting.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> ==============================================================
>
> Using the cpuset cgroup, we can restrict the test script to run on NUMA
> node 0 only.
>
> - Default value of 5
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 46 | |
> 512 -> 1023 : 695 | |
> 1024 -> 2047 : 19950 |* |
> 2048 -> 4095 : 1788 | |
> 4096 -> 8191 : 3392 | |
> 8192 -> 16383 : 2569 | |
> 16384 -> 32767 : 2619 | |
> 32768 -> 65535 : 3809 | |
> 65536 -> 131071 : 616182 |****************************************|
> 131072 -> 262143 : 295587 |******************* |
> 262144 -> 524287 : 75357 |**** |
> 524288 -> 1048575 : 15471 |* |
> 1048576 -> 2097151 : 2939 | |
> 2097152 -> 4194303 : 243 | |
> 4194304 -> 8388607 : 3 | |
>
> avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>
> The zone->lock contention becomes severe when there is only a single NUMA
> node. The average latency is approximately 144us, with the maximum
> latency exceeding 4ms.
>
> - Value set to 0
>
> nsecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 0 | |
> 4 -> 7 : 0 | |
> 8 -> 15 : 0 | |
> 16 -> 31 : 0 | |
> 32 -> 63 : 0 | |
> 64 -> 127 : 0 | |
> 128 -> 255 : 0 | |
> 256 -> 511 : 24 | |
> 512 -> 1023 : 2686 | |
> 1024 -> 2047 : 10246 | |
> 2048 -> 4095 : 4061529 |********* |
> 4096 -> 8191 : 16894971 |****************************************|
> 8192 -> 16383 : 6279310 |************** |
> 16384 -> 32767 : 1658240 |*** |
> 32768 -> 65535 : 445760 |* |
> 65536 -> 131071 : 110817 | |
> 131072 -> 262143 : 20279 | |
> 262144 -> 524287 : 4176 | |
> 524288 -> 1048575 : 436 | |
> 1048576 -> 2097151 : 8 | |
> 2097152 -> 4194303 : 2 | |
>
> avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>
> After setting it to 0, the avg latency is reduced to around 8us, and the
> max latency is less than 4ms.
>
> - Conclusion
>
> On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> applications work well with the default setting.
>
> It is worth noting that all the above data were tested using the upstream
> kernel.
>
> Why introduce a systl knob?
> ===========================
>
> From the above data, it's clear that different CPU types have varying
> allocation latencies concerning zone->lock contention. Typically, people
> don't release individual kernel packages for each type of x86_64 CPU.
>
> Furthermore, for latency-insensitive applications, we can keep the default
> setting for better throughput. In our production environment, we set this
> value to 0 for applications running on Kubernetes servers while keeping it
> at the default value of 5 for other applications like big data. It's not
> common to release individual kernel packages for each application.
Thanks for detailed performance data!
Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
your environment? If not, I suggest to use 0 as default for
CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
that, if someone found some other workloads need larger
CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
[snip]
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 3:18 ` Huang, Ying
@ 2024-07-29 3:40 ` Yafang Shao
2024-07-29 5:12 ` Huang, Ying
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 3:40 UTC (permalink / raw)
To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Yafang,
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > During my recent work to resolve latency spikes caused by zone->lock
> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> > in practice.
>
> As we discussed before [1], I still feel confusing about the description
> about zone->lock contention. How about change the description to
> something like,
Sure, I will change it.
>
> Larger page allocation/freeing batch number may cause longer run time of
> code holding zone->lock. If zone->lock is heavily contended at the same
> time, latency spikes may occur even for casual page allocation/freeing.
> Although reducing the batch number cannot make zone->lock contended
> lighter, it can reduce the latency spikes effectively.
>
> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> > To demonstrate this, I wrote a Python script:
> >
> > import mmap
> >
> > size = 6 * 1024**3
> >
> > while True:
> > mm = mmap.mmap(-1, size)
> > mm[:] = b'\xff' * size
> > mm.close()
> >
> > Run this script 10 times in parallel and measure the allocation latency by
> > measuring the duration of rmqueue_bulk() with the BCC tools
> > funclatency[1]:
> >
> > funclatency -T -i 600 rmqueue_bulk
> >
> > Here are the results for both AMD and Intel CPUs.
> >
> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> > =====================================================================
> >
> > - Default value of 5
> >
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | |
> > 256 -> 511 : 0 | |
> > 512 -> 1023 : 12 | |
> > 1024 -> 2047 : 9116 | |
> > 2048 -> 4095 : 2004 | |
> > 4096 -> 8191 : 2497 | |
> > 8192 -> 16383 : 2127 | |
> > 16384 -> 32767 : 2483 | |
> > 32768 -> 65535 : 10102 | |
> > 65536 -> 131071 : 212730 |******************* |
> > 131072 -> 262143 : 314692 |***************************** |
> > 262144 -> 524287 : 430058 |****************************************|
> > 524288 -> 1048575 : 224032 |******************** |
> > 1048576 -> 2097151 : 73567 |****** |
> > 2097152 -> 4194303 : 17079 |* |
> > 4194304 -> 8388607 : 3900 | |
> > 8388608 -> 16777215 : 750 | |
> > 16777216 -> 33554431 : 88 | |
> > 33554432 -> 67108863 : 2 | |
> >
> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >
> > The avg alloc latency can be 449us, and the max latency can be higher
> > than 30ms.
> >
> > - Value set to 0
> >
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | |
> > 256 -> 511 : 0 | |
> > 512 -> 1023 : 92 | |
> > 1024 -> 2047 : 8594 | |
> > 2048 -> 4095 : 2042818 |****** |
> > 4096 -> 8191 : 8737624 |************************** |
> > 8192 -> 16383 : 13147872 |****************************************|
> > 16384 -> 32767 : 8799951 |************************** |
> > 32768 -> 65535 : 2879715 |******** |
> > 65536 -> 131071 : 659600 |** |
> > 131072 -> 262143 : 204004 | |
> > 262144 -> 524287 : 78246 | |
> > 524288 -> 1048575 : 30800 | |
> > 1048576 -> 2097151 : 12251 | |
> > 2097152 -> 4194303 : 2950 | |
> > 4194304 -> 8388607 : 78 | |
> >
> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >
> > The avg was reduced significantly to 19us, and the max latency is reduced
> > to less than 8ms.
> >
> > - Conclusion
> >
> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> > latency. Latency-sensitive applications will benefit from this tuning.
> >
> > However, I don't have access to other types of AMD CPUs, so I was unable to
> > test it on different AMD models.
> >
> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> > ============================================================
> >
> > - Default value of 5
> >
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | |
> > 256 -> 511 : 0 | |
> > 512 -> 1023 : 2419 | |
> > 1024 -> 2047 : 34499 |* |
> > 2048 -> 4095 : 4272 | |
> > 4096 -> 8191 : 9035 | |
> > 8192 -> 16383 : 4374 | |
> > 16384 -> 32767 : 2963 | |
> > 32768 -> 65535 : 6407 | |
> > 65536 -> 131071 : 884806 |****************************************|
> > 131072 -> 262143 : 145931 |****** |
> > 262144 -> 524287 : 13406 | |
> > 524288 -> 1048575 : 1874 | |
> > 1048576 -> 2097151 : 249 | |
> > 2097152 -> 4194303 : 28 | |
> >
> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >
> > - Conclusion
> >
> > This Intel CPU works fine with the default setting.
> >
> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> > ==============================================================
> >
> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> > node 0 only.
> >
> > - Default value of 5
> >
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | |
> > 256 -> 511 : 46 | |
> > 512 -> 1023 : 695 | |
> > 1024 -> 2047 : 19950 |* |
> > 2048 -> 4095 : 1788 | |
> > 4096 -> 8191 : 3392 | |
> > 8192 -> 16383 : 2569 | |
> > 16384 -> 32767 : 2619 | |
> > 32768 -> 65535 : 3809 | |
> > 65536 -> 131071 : 616182 |****************************************|
> > 131072 -> 262143 : 295587 |******************* |
> > 262144 -> 524287 : 75357 |**** |
> > 524288 -> 1048575 : 15471 |* |
> > 1048576 -> 2097151 : 2939 | |
> > 2097152 -> 4194303 : 243 | |
> > 4194304 -> 8388607 : 3 | |
> >
> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >
> > The zone->lock contention becomes severe when there is only a single NUMA
> > node. The average latency is approximately 144us, with the maximum
> > latency exceeding 4ms.
> >
> > - Value set to 0
> >
> > nsecs : count distribution
> > 0 -> 1 : 0 | |
> > 2 -> 3 : 0 | |
> > 4 -> 7 : 0 | |
> > 8 -> 15 : 0 | |
> > 16 -> 31 : 0 | |
> > 32 -> 63 : 0 | |
> > 64 -> 127 : 0 | |
> > 128 -> 255 : 0 | |
> > 256 -> 511 : 24 | |
> > 512 -> 1023 : 2686 | |
> > 1024 -> 2047 : 10246 | |
> > 2048 -> 4095 : 4061529 |********* |
> > 4096 -> 8191 : 16894971 |****************************************|
> > 8192 -> 16383 : 6279310 |************** |
> > 16384 -> 32767 : 1658240 |*** |
> > 32768 -> 65535 : 445760 |* |
> > 65536 -> 131071 : 110817 | |
> > 131072 -> 262143 : 20279 | |
> > 262144 -> 524287 : 4176 | |
> > 524288 -> 1048575 : 436 | |
> > 1048576 -> 2097151 : 8 | |
> > 2097152 -> 4194303 : 2 | |
> >
> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >
> > After setting it to 0, the avg latency is reduced to around 8us, and the
> > max latency is less than 4ms.
> >
> > - Conclusion
> >
> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> > applications work well with the default setting.
> >
> > It is worth noting that all the above data were tested using the upstream
> > kernel.
> >
> > Why introduce a systl knob?
> > ===========================
> >
> > From the above data, it's clear that different CPU types have varying
> > allocation latencies concerning zone->lock contention. Typically, people
> > don't release individual kernel packages for each type of x86_64 CPU.
> >
> > Furthermore, for latency-insensitive applications, we can keep the default
> > setting for better throughput. In our production environment, we set this
> > value to 0 for applications running on Kubernetes servers while keeping it
> > at the default value of 5 for other applications like big data. It's not
> > common to release individual kernel packages for each application.
>
> Thanks for detailed performance data!
>
> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> your environment? If not, I suggest to use 0 as default for
> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
> that, if someone found some other workloads need larger
> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>
The decision doesn’t rest with us, the kernel team at our company.
It’s made by the system administrators who manage a large number of
servers. The latency spikes only occur on the Kubernetes (k8s)
servers, not in other environments like big data servers. We have
informed other system administrators, such as those managing the big
data servers, about the latency spike issues, but they are unwilling
to make the change.
No one wants to make changes unless there is evidence showing that the
old settings will negatively impact them. However, as you know,
latency is not a critical concern for big data; throughput is more
important. If we keep the current settings, we will have to release
different kernel packages for different environments, which is a
significant burden for us.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 3:40 ` Yafang Shao
@ 2024-07-29 5:12 ` Huang, Ying
2024-07-29 5:45 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2024-07-29 5:12 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
Yafang Shao <laoar.shao@gmail.com> writes:
> On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Yafang,
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > During my recent work to resolve latency spikes caused by zone->lock
>> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
>> > in practice.
>>
>> As we discussed before [1], I still feel confusing about the description
>> about zone->lock contention. How about change the description to
>> something like,
>
> Sure, I will change it.
>
>>
>> Larger page allocation/freeing batch number may cause longer run time of
>> code holding zone->lock. If zone->lock is heavily contended at the same
>> time, latency spikes may occur even for casual page allocation/freeing.
>> Although reducing the batch number cannot make zone->lock contended
>> lighter, it can reduce the latency spikes effectively.
>>
>> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>> > To demonstrate this, I wrote a Python script:
>> >
>> > import mmap
>> >
>> > size = 6 * 1024**3
>> >
>> > while True:
>> > mm = mmap.mmap(-1, size)
>> > mm[:] = b'\xff' * size
>> > mm.close()
>> >
>> > Run this script 10 times in parallel and measure the allocation latency by
>> > measuring the duration of rmqueue_bulk() with the BCC tools
>> > funclatency[1]:
>> >
>> > funclatency -T -i 600 rmqueue_bulk
>> >
>> > Here are the results for both AMD and Intel CPUs.
>> >
>> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
>> > =====================================================================
>> >
>> > - Default value of 5
>> >
>> > nsecs : count distribution
>> > 0 -> 1 : 0 | |
>> > 2 -> 3 : 0 | |
>> > 4 -> 7 : 0 | |
>> > 8 -> 15 : 0 | |
>> > 16 -> 31 : 0 | |
>> > 32 -> 63 : 0 | |
>> > 64 -> 127 : 0 | |
>> > 128 -> 255 : 0 | |
>> > 256 -> 511 : 0 | |
>> > 512 -> 1023 : 12 | |
>> > 1024 -> 2047 : 9116 | |
>> > 2048 -> 4095 : 2004 | |
>> > 4096 -> 8191 : 2497 | |
>> > 8192 -> 16383 : 2127 | |
>> > 16384 -> 32767 : 2483 | |
>> > 32768 -> 65535 : 10102 | |
>> > 65536 -> 131071 : 212730 |******************* |
>> > 131072 -> 262143 : 314692 |***************************** |
>> > 262144 -> 524287 : 430058 |****************************************|
>> > 524288 -> 1048575 : 224032 |******************** |
>> > 1048576 -> 2097151 : 73567 |****** |
>> > 2097152 -> 4194303 : 17079 |* |
>> > 4194304 -> 8388607 : 3900 | |
>> > 8388608 -> 16777215 : 750 | |
>> > 16777216 -> 33554431 : 88 | |
>> > 33554432 -> 67108863 : 2 | |
>> >
>> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >
>> > The avg alloc latency can be 449us, and the max latency can be higher
>> > than 30ms.
>> >
>> > - Value set to 0
>> >
>> > nsecs : count distribution
>> > 0 -> 1 : 0 | |
>> > 2 -> 3 : 0 | |
>> > 4 -> 7 : 0 | |
>> > 8 -> 15 : 0 | |
>> > 16 -> 31 : 0 | |
>> > 32 -> 63 : 0 | |
>> > 64 -> 127 : 0 | |
>> > 128 -> 255 : 0 | |
>> > 256 -> 511 : 0 | |
>> > 512 -> 1023 : 92 | |
>> > 1024 -> 2047 : 8594 | |
>> > 2048 -> 4095 : 2042818 |****** |
>> > 4096 -> 8191 : 8737624 |************************** |
>> > 8192 -> 16383 : 13147872 |****************************************|
>> > 16384 -> 32767 : 8799951 |************************** |
>> > 32768 -> 65535 : 2879715 |******** |
>> > 65536 -> 131071 : 659600 |** |
>> > 131072 -> 262143 : 204004 | |
>> > 262144 -> 524287 : 78246 | |
>> > 524288 -> 1048575 : 30800 | |
>> > 1048576 -> 2097151 : 12251 | |
>> > 2097152 -> 4194303 : 2950 | |
>> > 4194304 -> 8388607 : 78 | |
>> >
>> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >
>> > The avg was reduced significantly to 19us, and the max latency is reduced
>> > to less than 8ms.
>> >
>> > - Conclusion
>> >
>> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
>> > latency. Latency-sensitive applications will benefit from this tuning.
>> >
>> > However, I don't have access to other types of AMD CPUs, so I was unable to
>> > test it on different AMD models.
>> >
>> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> > ============================================================
>> >
>> > - Default value of 5
>> >
>> > nsecs : count distribution
>> > 0 -> 1 : 0 | |
>> > 2 -> 3 : 0 | |
>> > 4 -> 7 : 0 | |
>> > 8 -> 15 : 0 | |
>> > 16 -> 31 : 0 | |
>> > 32 -> 63 : 0 | |
>> > 64 -> 127 : 0 | |
>> > 128 -> 255 : 0 | |
>> > 256 -> 511 : 0 | |
>> > 512 -> 1023 : 2419 | |
>> > 1024 -> 2047 : 34499 |* |
>> > 2048 -> 4095 : 4272 | |
>> > 4096 -> 8191 : 9035 | |
>> > 8192 -> 16383 : 4374 | |
>> > 16384 -> 32767 : 2963 | |
>> > 32768 -> 65535 : 6407 | |
>> > 65536 -> 131071 : 884806 |****************************************|
>> > 131072 -> 262143 : 145931 |****** |
>> > 262144 -> 524287 : 13406 | |
>> > 524288 -> 1048575 : 1874 | |
>> > 1048576 -> 2097151 : 249 | |
>> > 2097152 -> 4194303 : 28 | |
>> >
>> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >
>> > - Conclusion
>> >
>> > This Intel CPU works fine with the default setting.
>> >
>> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> > ==============================================================
>> >
>> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
>> > node 0 only.
>> >
>> > - Default value of 5
>> >
>> > nsecs : count distribution
>> > 0 -> 1 : 0 | |
>> > 2 -> 3 : 0 | |
>> > 4 -> 7 : 0 | |
>> > 8 -> 15 : 0 | |
>> > 16 -> 31 : 0 | |
>> > 32 -> 63 : 0 | |
>> > 64 -> 127 : 0 | |
>> > 128 -> 255 : 0 | |
>> > 256 -> 511 : 46 | |
>> > 512 -> 1023 : 695 | |
>> > 1024 -> 2047 : 19950 |* |
>> > 2048 -> 4095 : 1788 | |
>> > 4096 -> 8191 : 3392 | |
>> > 8192 -> 16383 : 2569 | |
>> > 16384 -> 32767 : 2619 | |
>> > 32768 -> 65535 : 3809 | |
>> > 65536 -> 131071 : 616182 |****************************************|
>> > 131072 -> 262143 : 295587 |******************* |
>> > 262144 -> 524287 : 75357 |**** |
>> > 524288 -> 1048575 : 15471 |* |
>> > 1048576 -> 2097151 : 2939 | |
>> > 2097152 -> 4194303 : 243 | |
>> > 4194304 -> 8388607 : 3 | |
>> >
>> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >
>> > The zone->lock contention becomes severe when there is only a single NUMA
>> > node. The average latency is approximately 144us, with the maximum
>> > latency exceeding 4ms.
>> >
>> > - Value set to 0
>> >
>> > nsecs : count distribution
>> > 0 -> 1 : 0 | |
>> > 2 -> 3 : 0 | |
>> > 4 -> 7 : 0 | |
>> > 8 -> 15 : 0 | |
>> > 16 -> 31 : 0 | |
>> > 32 -> 63 : 0 | |
>> > 64 -> 127 : 0 | |
>> > 128 -> 255 : 0 | |
>> > 256 -> 511 : 24 | |
>> > 512 -> 1023 : 2686 | |
>> > 1024 -> 2047 : 10246 | |
>> > 2048 -> 4095 : 4061529 |********* |
>> > 4096 -> 8191 : 16894971 |****************************************|
>> > 8192 -> 16383 : 6279310 |************** |
>> > 16384 -> 32767 : 1658240 |*** |
>> > 32768 -> 65535 : 445760 |* |
>> > 65536 -> 131071 : 110817 | |
>> > 131072 -> 262143 : 20279 | |
>> > 262144 -> 524287 : 4176 | |
>> > 524288 -> 1048575 : 436 | |
>> > 1048576 -> 2097151 : 8 | |
>> > 2097152 -> 4194303 : 2 | |
>> >
>> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >
>> > After setting it to 0, the avg latency is reduced to around 8us, and the
>> > max latency is less than 4ms.
>> >
>> > - Conclusion
>> >
>> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> > applications work well with the default setting.
>> >
>> > It is worth noting that all the above data were tested using the upstream
>> > kernel.
>> >
>> > Why introduce a systl knob?
>> > ===========================
>> >
>> > From the above data, it's clear that different CPU types have varying
>> > allocation latencies concerning zone->lock contention. Typically, people
>> > don't release individual kernel packages for each type of x86_64 CPU.
>> >
>> > Furthermore, for latency-insensitive applications, we can keep the default
>> > setting for better throughput. In our production environment, we set this
>> > value to 0 for applications running on Kubernetes servers while keeping it
>> > at the default value of 5 for other applications like big data. It's not
>> > common to release individual kernel packages for each application.
>>
>> Thanks for detailed performance data!
>>
>> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
>> your environment? If not, I suggest to use 0 as default for
>> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
>> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
>> that, if someone found some other workloads need larger
>> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>>
>
> The decision doesn’t rest with us, the kernel team at our company.
> It’s made by the system administrators who manage a large number of
> servers. The latency spikes only occur on the Kubernetes (k8s)
> servers, not in other environments like big data servers. We have
> informed other system administrators, such as those managing the big
> data servers, about the latency spike issues, but they are unwilling
> to make the change.
>
> No one wants to make changes unless there is evidence showing that the
> old settings will negatively impact them. However, as you know,
> latency is not a critical concern for big data; throughput is more
> important. If we keep the current settings, we will have to release
> different kernel packages for different environments, which is a
> significant burden for us.
Totally understand your requirements. And, I think that this is better
to be resolved in your downstream kernel. If there are clear evidences
to prove small batch number hurts throughput for some workloads, we can
make the change in the upstream kernel.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 5:12 ` Huang, Ying
@ 2024-07-29 5:45 ` Yafang Shao
2024-07-29 5:50 ` Huang, Ying
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 5:45 UTC (permalink / raw)
To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Yafang,
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > During my recent work to resolve latency spikes caused by zone->lock
> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> >> > in practice.
> >>
> >> As we discussed before [1], I still feel confusing about the description
> >> about zone->lock contention. How about change the description to
> >> something like,
> >
> > Sure, I will change it.
> >
> >>
> >> Larger page allocation/freeing batch number may cause longer run time of
> >> code holding zone->lock. If zone->lock is heavily contended at the same
> >> time, latency spikes may occur even for casual page allocation/freeing.
> >> Although reducing the batch number cannot make zone->lock contended
> >> lighter, it can reduce the latency spikes effectively.
> >>
> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> >>
> >> > To demonstrate this, I wrote a Python script:
> >> >
> >> > import mmap
> >> >
> >> > size = 6 * 1024**3
> >> >
> >> > while True:
> >> > mm = mmap.mmap(-1, size)
> >> > mm[:] = b'\xff' * size
> >> > mm.close()
> >> >
> >> > Run this script 10 times in parallel and measure the allocation latency by
> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> > funclatency[1]:
> >> >
> >> > funclatency -T -i 600 rmqueue_bulk
> >> >
> >> > Here are the results for both AMD and Intel CPUs.
> >> >
> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> >> > =====================================================================
> >> >
> >> > - Default value of 5
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 0 | |
> >> > 512 -> 1023 : 12 | |
> >> > 1024 -> 2047 : 9116 | |
> >> > 2048 -> 4095 : 2004 | |
> >> > 4096 -> 8191 : 2497 | |
> >> > 8192 -> 16383 : 2127 | |
> >> > 16384 -> 32767 : 2483 | |
> >> > 32768 -> 65535 : 10102 | |
> >> > 65536 -> 131071 : 212730 |******************* |
> >> > 131072 -> 262143 : 314692 |***************************** |
> >> > 262144 -> 524287 : 430058 |****************************************|
> >> > 524288 -> 1048575 : 224032 |******************** |
> >> > 1048576 -> 2097151 : 73567 |****** |
> >> > 2097152 -> 4194303 : 17079 |* |
> >> > 4194304 -> 8388607 : 3900 | |
> >> > 8388608 -> 16777215 : 750 | |
> >> > 16777216 -> 33554431 : 88 | |
> >> > 33554432 -> 67108863 : 2 | |
> >> >
> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >> >
> >> > The avg alloc latency can be 449us, and the max latency can be higher
> >> > than 30ms.
> >> >
> >> > - Value set to 0
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 0 | |
> >> > 512 -> 1023 : 92 | |
> >> > 1024 -> 2047 : 8594 | |
> >> > 2048 -> 4095 : 2042818 |****** |
> >> > 4096 -> 8191 : 8737624 |************************** |
> >> > 8192 -> 16383 : 13147872 |****************************************|
> >> > 16384 -> 32767 : 8799951 |************************** |
> >> > 32768 -> 65535 : 2879715 |******** |
> >> > 65536 -> 131071 : 659600 |** |
> >> > 131072 -> 262143 : 204004 | |
> >> > 262144 -> 524287 : 78246 | |
> >> > 524288 -> 1048575 : 30800 | |
> >> > 1048576 -> 2097151 : 12251 | |
> >> > 2097152 -> 4194303 : 2950 | |
> >> > 4194304 -> 8388607 : 78 | |
> >> >
> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >> >
> >> > The avg was reduced significantly to 19us, and the max latency is reduced
> >> > to less than 8ms.
> >> >
> >> > - Conclusion
> >> >
> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> >> > latency. Latency-sensitive applications will benefit from this tuning.
> >> >
> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
> >> > test it on different AMD models.
> >> >
> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> > ============================================================
> >> >
> >> > - Default value of 5
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 0 | |
> >> > 512 -> 1023 : 2419 | |
> >> > 1024 -> 2047 : 34499 |* |
> >> > 2048 -> 4095 : 4272 | |
> >> > 4096 -> 8191 : 9035 | |
> >> > 8192 -> 16383 : 4374 | |
> >> > 16384 -> 32767 : 2963 | |
> >> > 32768 -> 65535 : 6407 | |
> >> > 65536 -> 131071 : 884806 |****************************************|
> >> > 131072 -> 262143 : 145931 |****** |
> >> > 262144 -> 524287 : 13406 | |
> >> > 524288 -> 1048575 : 1874 | |
> >> > 1048576 -> 2097151 : 249 | |
> >> > 2097152 -> 4194303 : 28 | |
> >> >
> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >
> >> > - Conclusion
> >> >
> >> > This Intel CPU works fine with the default setting.
> >> >
> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> > ==============================================================
> >> >
> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> >> > node 0 only.
> >> >
> >> > - Default value of 5
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 46 | |
> >> > 512 -> 1023 : 695 | |
> >> > 1024 -> 2047 : 19950 |* |
> >> > 2048 -> 4095 : 1788 | |
> >> > 4096 -> 8191 : 3392 | |
> >> > 8192 -> 16383 : 2569 | |
> >> > 16384 -> 32767 : 2619 | |
> >> > 32768 -> 65535 : 3809 | |
> >> > 65536 -> 131071 : 616182 |****************************************|
> >> > 131072 -> 262143 : 295587 |******************* |
> >> > 262144 -> 524287 : 75357 |**** |
> >> > 524288 -> 1048575 : 15471 |* |
> >> > 1048576 -> 2097151 : 2939 | |
> >> > 2097152 -> 4194303 : 243 | |
> >> > 4194304 -> 8388607 : 3 | |
> >> >
> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >> >
> >> > The zone->lock contention becomes severe when there is only a single NUMA
> >> > node. The average latency is approximately 144us, with the maximum
> >> > latency exceeding 4ms.
> >> >
> >> > - Value set to 0
> >> >
> >> > nsecs : count distribution
> >> > 0 -> 1 : 0 | |
> >> > 2 -> 3 : 0 | |
> >> > 4 -> 7 : 0 | |
> >> > 8 -> 15 : 0 | |
> >> > 16 -> 31 : 0 | |
> >> > 32 -> 63 : 0 | |
> >> > 64 -> 127 : 0 | |
> >> > 128 -> 255 : 0 | |
> >> > 256 -> 511 : 24 | |
> >> > 512 -> 1023 : 2686 | |
> >> > 1024 -> 2047 : 10246 | |
> >> > 2048 -> 4095 : 4061529 |********* |
> >> > 4096 -> 8191 : 16894971 |****************************************|
> >> > 8192 -> 16383 : 6279310 |************** |
> >> > 16384 -> 32767 : 1658240 |*** |
> >> > 32768 -> 65535 : 445760 |* |
> >> > 65536 -> 131071 : 110817 | |
> >> > 131072 -> 262143 : 20279 | |
> >> > 262144 -> 524287 : 4176 | |
> >> > 524288 -> 1048575 : 436 | |
> >> > 1048576 -> 2097151 : 8 | |
> >> > 2097152 -> 4194303 : 2 | |
> >> >
> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >
> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
> >> > max latency is less than 4ms.
> >> >
> >> > - Conclusion
> >> >
> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> >> > applications work well with the default setting.
> >> >
> >> > It is worth noting that all the above data were tested using the upstream
> >> > kernel.
> >> >
> >> > Why introduce a systl knob?
> >> > ===========================
> >> >
> >> > From the above data, it's clear that different CPU types have varying
> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >
> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> > setting for better throughput. In our production environment, we set this
> >> > value to 0 for applications running on Kubernetes servers while keeping it
> >> > at the default value of 5 for other applications like big data. It's not
> >> > common to release individual kernel packages for each application.
> >>
> >> Thanks for detailed performance data!
> >>
> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> >> your environment? If not, I suggest to use 0 as default for
> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
> >> that, if someone found some other workloads need larger
> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >>
> >
> > The decision doesn’t rest with us, the kernel team at our company.
> > It’s made by the system administrators who manage a large number of
> > servers. The latency spikes only occur on the Kubernetes (k8s)
> > servers, not in other environments like big data servers. We have
> > informed other system administrators, such as those managing the big
> > data servers, about the latency spike issues, but they are unwilling
> > to make the change.
> >
> > No one wants to make changes unless there is evidence showing that the
> > old settings will negatively impact them. However, as you know,
> > latency is not a critical concern for big data; throughput is more
> > important. If we keep the current settings, we will have to release
> > different kernel packages for different environments, which is a
> > significant burden for us.
>
> Totally understand your requirements. And, I think that this is better
> to be resolved in your downstream kernel. If there are clear evidences
> to prove small batch number hurts throughput for some workloads, we can
> make the change in the upstream kernel.
>
Please don't make this more complicated. We are at an impasse.
The key issue here is that the upstream kernel has a default value of
5, not 0. If you can change it to 0, we can persuade our users to
follow the upstream changes. They currently set it to 5, not because
you, the author, chose this value, but because it is the default in
Linus's tree. Since it's in Linus's tree, kernel developers worldwide
support it. It's not just your decision as the author, but the entire
community supports this default.
If, in the future, we find that the value of 0 is not suitable, you'll
tell us, "It is an issue in your downstream kernel, not in the
upstream kernel, so we won't accept it." PANIC.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 5:45 ` Yafang Shao
@ 2024-07-29 5:50 ` Huang, Ying
2024-07-29 6:00 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2024-07-29 5:50 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
Yafang Shao <laoar.shao@gmail.com> writes:
> On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Yafang,
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > During my recent work to resolve latency spikes caused by zone->lock
>> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
>> >> > in practice.
>> >>
>> >> As we discussed before [1], I still feel confusing about the description
>> >> about zone->lock contention. How about change the description to
>> >> something like,
>> >
>> > Sure, I will change it.
>> >
>> >>
>> >> Larger page allocation/freeing batch number may cause longer run time of
>> >> code holding zone->lock. If zone->lock is heavily contended at the same
>> >> time, latency spikes may occur even for casual page allocation/freeing.
>> >> Although reducing the batch number cannot make zone->lock contended
>> >> lighter, it can reduce the latency spikes effectively.
>> >>
>> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> >>
>> >> > To demonstrate this, I wrote a Python script:
>> >> >
>> >> > import mmap
>> >> >
>> >> > size = 6 * 1024**3
>> >> >
>> >> > while True:
>> >> > mm = mmap.mmap(-1, size)
>> >> > mm[:] = b'\xff' * size
>> >> > mm.close()
>> >> >
>> >> > Run this script 10 times in parallel and measure the allocation latency by
>> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> > funclatency[1]:
>> >> >
>> >> > funclatency -T -i 600 rmqueue_bulk
>> >> >
>> >> > Here are the results for both AMD and Intel CPUs.
>> >> >
>> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
>> >> > =====================================================================
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> > nsecs : count distribution
>> >> > 0 -> 1 : 0 | |
>> >> > 2 -> 3 : 0 | |
>> >> > 4 -> 7 : 0 | |
>> >> > 8 -> 15 : 0 | |
>> >> > 16 -> 31 : 0 | |
>> >> > 32 -> 63 : 0 | |
>> >> > 64 -> 127 : 0 | |
>> >> > 128 -> 255 : 0 | |
>> >> > 256 -> 511 : 0 | |
>> >> > 512 -> 1023 : 12 | |
>> >> > 1024 -> 2047 : 9116 | |
>> >> > 2048 -> 4095 : 2004 | |
>> >> > 4096 -> 8191 : 2497 | |
>> >> > 8192 -> 16383 : 2127 | |
>> >> > 16384 -> 32767 : 2483 | |
>> >> > 32768 -> 65535 : 10102 | |
>> >> > 65536 -> 131071 : 212730 |******************* |
>> >> > 131072 -> 262143 : 314692 |***************************** |
>> >> > 262144 -> 524287 : 430058 |****************************************|
>> >> > 524288 -> 1048575 : 224032 |******************** |
>> >> > 1048576 -> 2097151 : 73567 |****** |
>> >> > 2097152 -> 4194303 : 17079 |* |
>> >> > 4194304 -> 8388607 : 3900 | |
>> >> > 8388608 -> 16777215 : 750 | |
>> >> > 16777216 -> 33554431 : 88 | |
>> >> > 33554432 -> 67108863 : 2 | |
>> >> >
>> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >> >
>> >> > The avg alloc latency can be 449us, and the max latency can be higher
>> >> > than 30ms.
>> >> >
>> >> > - Value set to 0
>> >> >
>> >> > nsecs : count distribution
>> >> > 0 -> 1 : 0 | |
>> >> > 2 -> 3 : 0 | |
>> >> > 4 -> 7 : 0 | |
>> >> > 8 -> 15 : 0 | |
>> >> > 16 -> 31 : 0 | |
>> >> > 32 -> 63 : 0 | |
>> >> > 64 -> 127 : 0 | |
>> >> > 128 -> 255 : 0 | |
>> >> > 256 -> 511 : 0 | |
>> >> > 512 -> 1023 : 92 | |
>> >> > 1024 -> 2047 : 8594 | |
>> >> > 2048 -> 4095 : 2042818 |****** |
>> >> > 4096 -> 8191 : 8737624 |************************** |
>> >> > 8192 -> 16383 : 13147872 |****************************************|
>> >> > 16384 -> 32767 : 8799951 |************************** |
>> >> > 32768 -> 65535 : 2879715 |******** |
>> >> > 65536 -> 131071 : 659600 |** |
>> >> > 131072 -> 262143 : 204004 | |
>> >> > 262144 -> 524287 : 78246 | |
>> >> > 524288 -> 1048575 : 30800 | |
>> >> > 1048576 -> 2097151 : 12251 | |
>> >> > 2097152 -> 4194303 : 2950 | |
>> >> > 4194304 -> 8388607 : 78 | |
>> >> >
>> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >> >
>> >> > The avg was reduced significantly to 19us, and the max latency is reduced
>> >> > to less than 8ms.
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
>> >> > latency. Latency-sensitive applications will benefit from this tuning.
>> >> >
>> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
>> >> > test it on different AMD models.
>> >> >
>> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> > ============================================================
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> > nsecs : count distribution
>> >> > 0 -> 1 : 0 | |
>> >> > 2 -> 3 : 0 | |
>> >> > 4 -> 7 : 0 | |
>> >> > 8 -> 15 : 0 | |
>> >> > 16 -> 31 : 0 | |
>> >> > 32 -> 63 : 0 | |
>> >> > 64 -> 127 : 0 | |
>> >> > 128 -> 255 : 0 | |
>> >> > 256 -> 511 : 0 | |
>> >> > 512 -> 1023 : 2419 | |
>> >> > 1024 -> 2047 : 34499 |* |
>> >> > 2048 -> 4095 : 4272 | |
>> >> > 4096 -> 8191 : 9035 | |
>> >> > 8192 -> 16383 : 4374 | |
>> >> > 16384 -> 32767 : 2963 | |
>> >> > 32768 -> 65535 : 6407 | |
>> >> > 65536 -> 131071 : 884806 |****************************************|
>> >> > 131072 -> 262143 : 145931 |****** |
>> >> > 262144 -> 524287 : 13406 | |
>> >> > 524288 -> 1048575 : 1874 | |
>> >> > 1048576 -> 2097151 : 249 | |
>> >> > 2097152 -> 4194303 : 28 | |
>> >> >
>> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > This Intel CPU works fine with the default setting.
>> >> >
>> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> > ==============================================================
>> >> >
>> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
>> >> > node 0 only.
>> >> >
>> >> > - Default value of 5
>> >> >
>> >> > nsecs : count distribution
>> >> > 0 -> 1 : 0 | |
>> >> > 2 -> 3 : 0 | |
>> >> > 4 -> 7 : 0 | |
>> >> > 8 -> 15 : 0 | |
>> >> > 16 -> 31 : 0 | |
>> >> > 32 -> 63 : 0 | |
>> >> > 64 -> 127 : 0 | |
>> >> > 128 -> 255 : 0 | |
>> >> > 256 -> 511 : 46 | |
>> >> > 512 -> 1023 : 695 | |
>> >> > 1024 -> 2047 : 19950 |* |
>> >> > 2048 -> 4095 : 1788 | |
>> >> > 4096 -> 8191 : 3392 | |
>> >> > 8192 -> 16383 : 2569 | |
>> >> > 16384 -> 32767 : 2619 | |
>> >> > 32768 -> 65535 : 3809 | |
>> >> > 65536 -> 131071 : 616182 |****************************************|
>> >> > 131072 -> 262143 : 295587 |******************* |
>> >> > 262144 -> 524287 : 75357 |**** |
>> >> > 524288 -> 1048575 : 15471 |* |
>> >> > 1048576 -> 2097151 : 2939 | |
>> >> > 2097152 -> 4194303 : 243 | |
>> >> > 4194304 -> 8388607 : 3 | |
>> >> >
>> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >> >
>> >> > The zone->lock contention becomes severe when there is only a single NUMA
>> >> > node. The average latency is approximately 144us, with the maximum
>> >> > latency exceeding 4ms.
>> >> >
>> >> > - Value set to 0
>> >> >
>> >> > nsecs : count distribution
>> >> > 0 -> 1 : 0 | |
>> >> > 2 -> 3 : 0 | |
>> >> > 4 -> 7 : 0 | |
>> >> > 8 -> 15 : 0 | |
>> >> > 16 -> 31 : 0 | |
>> >> > 32 -> 63 : 0 | |
>> >> > 64 -> 127 : 0 | |
>> >> > 128 -> 255 : 0 | |
>> >> > 256 -> 511 : 24 | |
>> >> > 512 -> 1023 : 2686 | |
>> >> > 1024 -> 2047 : 10246 | |
>> >> > 2048 -> 4095 : 4061529 |********* |
>> >> > 4096 -> 8191 : 16894971 |****************************************|
>> >> > 8192 -> 16383 : 6279310 |************** |
>> >> > 16384 -> 32767 : 1658240 |*** |
>> >> > 32768 -> 65535 : 445760 |* |
>> >> > 65536 -> 131071 : 110817 | |
>> >> > 131072 -> 262143 : 20279 | |
>> >> > 262144 -> 524287 : 4176 | |
>> >> > 524288 -> 1048575 : 436 | |
>> >> > 1048576 -> 2097151 : 8 | |
>> >> > 2097152 -> 4194303 : 2 | |
>> >> >
>> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >
>> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
>> >> > max latency is less than 4ms.
>> >> >
>> >> > - Conclusion
>> >> >
>> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> >> > applications work well with the default setting.
>> >> >
>> >> > It is worth noting that all the above data were tested using the upstream
>> >> > kernel.
>> >> >
>> >> > Why introduce a systl knob?
>> >> > ===========================
>> >> >
>> >> > From the above data, it's clear that different CPU types have varying
>> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >
>> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> > setting for better throughput. In our production environment, we set this
>> >> > value to 0 for applications running on Kubernetes servers while keeping it
>> >> > at the default value of 5 for other applications like big data. It's not
>> >> > common to release individual kernel packages for each application.
>> >>
>> >> Thanks for detailed performance data!
>> >>
>> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
>> >> your environment? If not, I suggest to use 0 as default for
>> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
>> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
>> >> that, if someone found some other workloads need larger
>> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >>
>> >
>> > The decision doesn’t rest with us, the kernel team at our company.
>> > It’s made by the system administrators who manage a large number of
>> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> > servers, not in other environments like big data servers. We have
>> > informed other system administrators, such as those managing the big
>> > data servers, about the latency spike issues, but they are unwilling
>> > to make the change.
>> >
>> > No one wants to make changes unless there is evidence showing that the
>> > old settings will negatively impact them. However, as you know,
>> > latency is not a critical concern for big data; throughput is more
>> > important. If we keep the current settings, we will have to release
>> > different kernel packages for different environments, which is a
>> > significant burden for us.
>>
>> Totally understand your requirements. And, I think that this is better
>> to be resolved in your downstream kernel. If there are clear evidences
>> to prove small batch number hurts throughput for some workloads, we can
>> make the change in the upstream kernel.
>>
>
> Please don't make this more complicated. We are at an impasse.
>
> The key issue here is that the upstream kernel has a default value of
> 5, not 0. If you can change it to 0, we can persuade our users to
> follow the upstream changes. They currently set it to 5, not because
> you, the author, chose this value, but because it is the default in
> Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> support it. It's not just your decision as the author, but the entire
> community supports this default.
>
> If, in the future, we find that the value of 0 is not suitable, you'll
> tell us, "It is an issue in your downstream kernel, not in the
> upstream kernel, so we won't accept it." PANIC.
I don't think so. I suggest you to change the default value to 0. If
someone reported that his workloads need some other value, then we have
evidence that different workloads need different value. At that time,
we can suggest to add an user tunable knob.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 5:50 ` Huang, Ying
@ 2024-07-29 6:00 ` Yafang Shao
2024-07-29 6:00 ` Huang, Ying
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 6:00 UTC (permalink / raw)
To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Hi, Yafang,
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > During my recent work to resolve latency spikes caused by zone->lock
> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> >> >> > in practice.
> >> >>
> >> >> As we discussed before [1], I still feel confusing about the description
> >> >> about zone->lock contention. How about change the description to
> >> >> something like,
> >> >
> >> > Sure, I will change it.
> >> >
> >> >>
> >> >> Larger page allocation/freeing batch number may cause longer run time of
> >> >> code holding zone->lock. If zone->lock is heavily contended at the same
> >> >> time, latency spikes may occur even for casual page allocation/freeing.
> >> >> Although reducing the batch number cannot make zone->lock contended
> >> >> lighter, it can reduce the latency spikes effectively.
> >> >>
> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> >> >>
> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >
> >> >> > import mmap
> >> >> >
> >> >> > size = 6 * 1024**3
> >> >> >
> >> >> > while True:
> >> >> > mm = mmap.mmap(-1, size)
> >> >> > mm[:] = b'\xff' * size
> >> >> > mm.close()
> >> >> >
> >> >> > Run this script 10 times in parallel and measure the allocation latency by
> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> > funclatency[1]:
> >> >> >
> >> >> > funclatency -T -i 600 rmqueue_bulk
> >> >> >
> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >
> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> >> >> > =====================================================================
> >> >> >
> >> >> > - Default value of 5
> >> >> >
> >> >> > nsecs : count distribution
> >> >> > 0 -> 1 : 0 | |
> >> >> > 2 -> 3 : 0 | |
> >> >> > 4 -> 7 : 0 | |
> >> >> > 8 -> 15 : 0 | |
> >> >> > 16 -> 31 : 0 | |
> >> >> > 32 -> 63 : 0 | |
> >> >> > 64 -> 127 : 0 | |
> >> >> > 128 -> 255 : 0 | |
> >> >> > 256 -> 511 : 0 | |
> >> >> > 512 -> 1023 : 12 | |
> >> >> > 1024 -> 2047 : 9116 | |
> >> >> > 2048 -> 4095 : 2004 | |
> >> >> > 4096 -> 8191 : 2497 | |
> >> >> > 8192 -> 16383 : 2127 | |
> >> >> > 16384 -> 32767 : 2483 | |
> >> >> > 32768 -> 65535 : 10102 | |
> >> >> > 65536 -> 131071 : 212730 |******************* |
> >> >> > 131072 -> 262143 : 314692 |***************************** |
> >> >> > 262144 -> 524287 : 430058 |****************************************|
> >> >> > 524288 -> 1048575 : 224032 |******************** |
> >> >> > 1048576 -> 2097151 : 73567 |****** |
> >> >> > 2097152 -> 4194303 : 17079 |* |
> >> >> > 4194304 -> 8388607 : 3900 | |
> >> >> > 8388608 -> 16777215 : 750 | |
> >> >> > 16777216 -> 33554431 : 88 | |
> >> >> > 33554432 -> 67108863 : 2 | |
> >> >> >
> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >> >> >
> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
> >> >> > than 30ms.
> >> >> >
> >> >> > - Value set to 0
> >> >> >
> >> >> > nsecs : count distribution
> >> >> > 0 -> 1 : 0 | |
> >> >> > 2 -> 3 : 0 | |
> >> >> > 4 -> 7 : 0 | |
> >> >> > 8 -> 15 : 0 | |
> >> >> > 16 -> 31 : 0 | |
> >> >> > 32 -> 63 : 0 | |
> >> >> > 64 -> 127 : 0 | |
> >> >> > 128 -> 255 : 0 | |
> >> >> > 256 -> 511 : 0 | |
> >> >> > 512 -> 1023 : 92 | |
> >> >> > 1024 -> 2047 : 8594 | |
> >> >> > 2048 -> 4095 : 2042818 |****** |
> >> >> > 4096 -> 8191 : 8737624 |************************** |
> >> >> > 8192 -> 16383 : 13147872 |****************************************|
> >> >> > 16384 -> 32767 : 8799951 |************************** |
> >> >> > 32768 -> 65535 : 2879715 |******** |
> >> >> > 65536 -> 131071 : 659600 |** |
> >> >> > 131072 -> 262143 : 204004 | |
> >> >> > 262144 -> 524287 : 78246 | |
> >> >> > 524288 -> 1048575 : 30800 | |
> >> >> > 1048576 -> 2097151 : 12251 | |
> >> >> > 2097152 -> 4194303 : 2950 | |
> >> >> > 4194304 -> 8388607 : 78 | |
> >> >> >
> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >> >> >
> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
> >> >> > to less than 8ms.
> >> >> >
> >> >> > - Conclusion
> >> >> >
> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
> >> >> >
> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
> >> >> > test it on different AMD models.
> >> >> >
> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> >> > ============================================================
> >> >> >
> >> >> > - Default value of 5
> >> >> >
> >> >> > nsecs : count distribution
> >> >> > 0 -> 1 : 0 | |
> >> >> > 2 -> 3 : 0 | |
> >> >> > 4 -> 7 : 0 | |
> >> >> > 8 -> 15 : 0 | |
> >> >> > 16 -> 31 : 0 | |
> >> >> > 32 -> 63 : 0 | |
> >> >> > 64 -> 127 : 0 | |
> >> >> > 128 -> 255 : 0 | |
> >> >> > 256 -> 511 : 0 | |
> >> >> > 512 -> 1023 : 2419 | |
> >> >> > 1024 -> 2047 : 34499 |* |
> >> >> > 2048 -> 4095 : 4272 | |
> >> >> > 4096 -> 8191 : 9035 | |
> >> >> > 8192 -> 16383 : 4374 | |
> >> >> > 16384 -> 32767 : 2963 | |
> >> >> > 32768 -> 65535 : 6407 | |
> >> >> > 65536 -> 131071 : 884806 |****************************************|
> >> >> > 131072 -> 262143 : 145931 |****** |
> >> >> > 262144 -> 524287 : 13406 | |
> >> >> > 524288 -> 1048575 : 1874 | |
> >> >> > 1048576 -> 2097151 : 249 | |
> >> >> > 2097152 -> 4194303 : 28 | |
> >> >> >
> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >> >
> >> >> > - Conclusion
> >> >> >
> >> >> > This Intel CPU works fine with the default setting.
> >> >> >
> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> >> > ==============================================================
> >> >> >
> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> >> >> > node 0 only.
> >> >> >
> >> >> > - Default value of 5
> >> >> >
> >> >> > nsecs : count distribution
> >> >> > 0 -> 1 : 0 | |
> >> >> > 2 -> 3 : 0 | |
> >> >> > 4 -> 7 : 0 | |
> >> >> > 8 -> 15 : 0 | |
> >> >> > 16 -> 31 : 0 | |
> >> >> > 32 -> 63 : 0 | |
> >> >> > 64 -> 127 : 0 | |
> >> >> > 128 -> 255 : 0 | |
> >> >> > 256 -> 511 : 46 | |
> >> >> > 512 -> 1023 : 695 | |
> >> >> > 1024 -> 2047 : 19950 |* |
> >> >> > 2048 -> 4095 : 1788 | |
> >> >> > 4096 -> 8191 : 3392 | |
> >> >> > 8192 -> 16383 : 2569 | |
> >> >> > 16384 -> 32767 : 2619 | |
> >> >> > 32768 -> 65535 : 3809 | |
> >> >> > 65536 -> 131071 : 616182 |****************************************|
> >> >> > 131072 -> 262143 : 295587 |******************* |
> >> >> > 262144 -> 524287 : 75357 |**** |
> >> >> > 524288 -> 1048575 : 15471 |* |
> >> >> > 1048576 -> 2097151 : 2939 | |
> >> >> > 2097152 -> 4194303 : 243 | |
> >> >> > 4194304 -> 8388607 : 3 | |
> >> >> >
> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >> >> >
> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
> >> >> > node. The average latency is approximately 144us, with the maximum
> >> >> > latency exceeding 4ms.
> >> >> >
> >> >> > - Value set to 0
> >> >> >
> >> >> > nsecs : count distribution
> >> >> > 0 -> 1 : 0 | |
> >> >> > 2 -> 3 : 0 | |
> >> >> > 4 -> 7 : 0 | |
> >> >> > 8 -> 15 : 0 | |
> >> >> > 16 -> 31 : 0 | |
> >> >> > 32 -> 63 : 0 | |
> >> >> > 64 -> 127 : 0 | |
> >> >> > 128 -> 255 : 0 | |
> >> >> > 256 -> 511 : 24 | |
> >> >> > 512 -> 1023 : 2686 | |
> >> >> > 1024 -> 2047 : 10246 | |
> >> >> > 2048 -> 4095 : 4061529 |********* |
> >> >> > 4096 -> 8191 : 16894971 |****************************************|
> >> >> > 8192 -> 16383 : 6279310 |************** |
> >> >> > 16384 -> 32767 : 1658240 |*** |
> >> >> > 32768 -> 65535 : 445760 |* |
> >> >> > 65536 -> 131071 : 110817 | |
> >> >> > 131072 -> 262143 : 20279 | |
> >> >> > 262144 -> 524287 : 4176 | |
> >> >> > 524288 -> 1048575 : 436 | |
> >> >> > 1048576 -> 2097151 : 8 | |
> >> >> > 2097152 -> 4194303 : 2 | |
> >> >> >
> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >> >
> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
> >> >> > max latency is less than 4ms.
> >> >> >
> >> >> > - Conclusion
> >> >> >
> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> >> >> > applications work well with the default setting.
> >> >> >
> >> >> > It is worth noting that all the above data were tested using the upstream
> >> >> > kernel.
> >> >> >
> >> >> > Why introduce a systl knob?
> >> >> > ===========================
> >> >> >
> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >
> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> > setting for better throughput. In our production environment, we set this
> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
> >> >> > at the default value of 5 for other applications like big data. It's not
> >> >> > common to release individual kernel packages for each application.
> >> >>
> >> >> Thanks for detailed performance data!
> >> >>
> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> >> >> your environment? If not, I suggest to use 0 as default for
> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
> >> >> that, if someone found some other workloads need larger
> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >> >>
> >> >
> >> > The decision doesn’t rest with us, the kernel team at our company.
> >> > It’s made by the system administrators who manage a large number of
> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> > servers, not in other environments like big data servers. We have
> >> > informed other system administrators, such as those managing the big
> >> > data servers, about the latency spike issues, but they are unwilling
> >> > to make the change.
> >> >
> >> > No one wants to make changes unless there is evidence showing that the
> >> > old settings will negatively impact them. However, as you know,
> >> > latency is not a critical concern for big data; throughput is more
> >> > important. If we keep the current settings, we will have to release
> >> > different kernel packages for different environments, which is a
> >> > significant burden for us.
> >>
> >> Totally understand your requirements. And, I think that this is better
> >> to be resolved in your downstream kernel. If there are clear evidences
> >> to prove small batch number hurts throughput for some workloads, we can
> >> make the change in the upstream kernel.
> >>
> >
> > Please don't make this more complicated. We are at an impasse.
> >
> > The key issue here is that the upstream kernel has a default value of
> > 5, not 0. If you can change it to 0, we can persuade our users to
> > follow the upstream changes. They currently set it to 5, not because
> > you, the author, chose this value, but because it is the default in
> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> > support it. It's not just your decision as the author, but the entire
> > community supports this default.
> >
> > If, in the future, we find that the value of 0 is not suitable, you'll
> > tell us, "It is an issue in your downstream kernel, not in the
> > upstream kernel, so we won't accept it." PANIC.
>
> I don't think so. I suggest you to change the default value to 0. If
> someone reported that his workloads need some other value, then we have
> evidence that different workloads need different value. At that time,
> we can suggest to add an user tunable knob.
>
The problem is that others are unaware we've set it to 0, and I can't
constantly monitor the linux-mm mailing list. Additionally, it's
possible that you can't always keep an eye on it either.
I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 6:00 ` Yafang Shao
@ 2024-07-29 6:00 ` Huang, Ying
2024-07-29 6:13 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2024-07-29 6:00 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
Yafang Shao <laoar.shao@gmail.com> writes:
> On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Hi, Yafang,
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > During my recent work to resolve latency spikes caused by zone->lock
>> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
>> >> >> > in practice.
>> >> >>
>> >> >> As we discussed before [1], I still feel confusing about the description
>> >> >> about zone->lock contention. How about change the description to
>> >> >> something like,
>> >> >
>> >> > Sure, I will change it.
>> >> >
>> >> >>
>> >> >> Larger page allocation/freeing batch number may cause longer run time of
>> >> >> code holding zone->lock. If zone->lock is heavily contended at the same
>> >> >> time, latency spikes may occur even for casual page allocation/freeing.
>> >> >> Although reducing the batch number cannot make zone->lock contended
>> >> >> lighter, it can reduce the latency spikes effectively.
>> >> >>
>> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> >> >>
>> >> >> > To demonstrate this, I wrote a Python script:
>> >> >> >
>> >> >> > import mmap
>> >> >> >
>> >> >> > size = 6 * 1024**3
>> >> >> >
>> >> >> > while True:
>> >> >> > mm = mmap.mmap(-1, size)
>> >> >> > mm[:] = b'\xff' * size
>> >> >> > mm.close()
>> >> >> >
>> >> >> > Run this script 10 times in parallel and measure the allocation latency by
>> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> >> > funclatency[1]:
>> >> >> >
>> >> >> > funclatency -T -i 600 rmqueue_bulk
>> >> >> >
>> >> >> > Here are the results for both AMD and Intel CPUs.
>> >> >> >
>> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
>> >> >> > =====================================================================
>> >> >> >
>> >> >> > - Default value of 5
>> >> >> >
>> >> >> > nsecs : count distribution
>> >> >> > 0 -> 1 : 0 | |
>> >> >> > 2 -> 3 : 0 | |
>> >> >> > 4 -> 7 : 0 | |
>> >> >> > 8 -> 15 : 0 | |
>> >> >> > 16 -> 31 : 0 | |
>> >> >> > 32 -> 63 : 0 | |
>> >> >> > 64 -> 127 : 0 | |
>> >> >> > 128 -> 255 : 0 | |
>> >> >> > 256 -> 511 : 0 | |
>> >> >> > 512 -> 1023 : 12 | |
>> >> >> > 1024 -> 2047 : 9116 | |
>> >> >> > 2048 -> 4095 : 2004 | |
>> >> >> > 4096 -> 8191 : 2497 | |
>> >> >> > 8192 -> 16383 : 2127 | |
>> >> >> > 16384 -> 32767 : 2483 | |
>> >> >> > 32768 -> 65535 : 10102 | |
>> >> >> > 65536 -> 131071 : 212730 |******************* |
>> >> >> > 131072 -> 262143 : 314692 |***************************** |
>> >> >> > 262144 -> 524287 : 430058 |****************************************|
>> >> >> > 524288 -> 1048575 : 224032 |******************** |
>> >> >> > 1048576 -> 2097151 : 73567 |****** |
>> >> >> > 2097152 -> 4194303 : 17079 |* |
>> >> >> > 4194304 -> 8388607 : 3900 | |
>> >> >> > 8388608 -> 16777215 : 750 | |
>> >> >> > 16777216 -> 33554431 : 88 | |
>> >> >> > 33554432 -> 67108863 : 2 | |
>> >> >> >
>> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >> >> >
>> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
>> >> >> > than 30ms.
>> >> >> >
>> >> >> > - Value set to 0
>> >> >> >
>> >> >> > nsecs : count distribution
>> >> >> > 0 -> 1 : 0 | |
>> >> >> > 2 -> 3 : 0 | |
>> >> >> > 4 -> 7 : 0 | |
>> >> >> > 8 -> 15 : 0 | |
>> >> >> > 16 -> 31 : 0 | |
>> >> >> > 32 -> 63 : 0 | |
>> >> >> > 64 -> 127 : 0 | |
>> >> >> > 128 -> 255 : 0 | |
>> >> >> > 256 -> 511 : 0 | |
>> >> >> > 512 -> 1023 : 92 | |
>> >> >> > 1024 -> 2047 : 8594 | |
>> >> >> > 2048 -> 4095 : 2042818 |****** |
>> >> >> > 4096 -> 8191 : 8737624 |************************** |
>> >> >> > 8192 -> 16383 : 13147872 |****************************************|
>> >> >> > 16384 -> 32767 : 8799951 |************************** |
>> >> >> > 32768 -> 65535 : 2879715 |******** |
>> >> >> > 65536 -> 131071 : 659600 |** |
>> >> >> > 131072 -> 262143 : 204004 | |
>> >> >> > 262144 -> 524287 : 78246 | |
>> >> >> > 524288 -> 1048575 : 30800 | |
>> >> >> > 1048576 -> 2097151 : 12251 | |
>> >> >> > 2097152 -> 4194303 : 2950 | |
>> >> >> > 4194304 -> 8388607 : 78 | |
>> >> >> >
>> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >> >> >
>> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
>> >> >> > to less than 8ms.
>> >> >> >
>> >> >> > - Conclusion
>> >> >> >
>> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
>> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
>> >> >> >
>> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
>> >> >> > test it on different AMD models.
>> >> >> >
>> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> >> > ============================================================
>> >> >> >
>> >> >> > - Default value of 5
>> >> >> >
>> >> >> > nsecs : count distribution
>> >> >> > 0 -> 1 : 0 | |
>> >> >> > 2 -> 3 : 0 | |
>> >> >> > 4 -> 7 : 0 | |
>> >> >> > 8 -> 15 : 0 | |
>> >> >> > 16 -> 31 : 0 | |
>> >> >> > 32 -> 63 : 0 | |
>> >> >> > 64 -> 127 : 0 | |
>> >> >> > 128 -> 255 : 0 | |
>> >> >> > 256 -> 511 : 0 | |
>> >> >> > 512 -> 1023 : 2419 | |
>> >> >> > 1024 -> 2047 : 34499 |* |
>> >> >> > 2048 -> 4095 : 4272 | |
>> >> >> > 4096 -> 8191 : 9035 | |
>> >> >> > 8192 -> 16383 : 4374 | |
>> >> >> > 16384 -> 32767 : 2963 | |
>> >> >> > 32768 -> 65535 : 6407 | |
>> >> >> > 65536 -> 131071 : 884806 |****************************************|
>> >> >> > 131072 -> 262143 : 145931 |****** |
>> >> >> > 262144 -> 524287 : 13406 | |
>> >> >> > 524288 -> 1048575 : 1874 | |
>> >> >> > 1048576 -> 2097151 : 249 | |
>> >> >> > 2097152 -> 4194303 : 28 | |
>> >> >> >
>> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >> >
>> >> >> > - Conclusion
>> >> >> >
>> >> >> > This Intel CPU works fine with the default setting.
>> >> >> >
>> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> >> > ==============================================================
>> >> >> >
>> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
>> >> >> > node 0 only.
>> >> >> >
>> >> >> > - Default value of 5
>> >> >> >
>> >> >> > nsecs : count distribution
>> >> >> > 0 -> 1 : 0 | |
>> >> >> > 2 -> 3 : 0 | |
>> >> >> > 4 -> 7 : 0 | |
>> >> >> > 8 -> 15 : 0 | |
>> >> >> > 16 -> 31 : 0 | |
>> >> >> > 32 -> 63 : 0 | |
>> >> >> > 64 -> 127 : 0 | |
>> >> >> > 128 -> 255 : 0 | |
>> >> >> > 256 -> 511 : 46 | |
>> >> >> > 512 -> 1023 : 695 | |
>> >> >> > 1024 -> 2047 : 19950 |* |
>> >> >> > 2048 -> 4095 : 1788 | |
>> >> >> > 4096 -> 8191 : 3392 | |
>> >> >> > 8192 -> 16383 : 2569 | |
>> >> >> > 16384 -> 32767 : 2619 | |
>> >> >> > 32768 -> 65535 : 3809 | |
>> >> >> > 65536 -> 131071 : 616182 |****************************************|
>> >> >> > 131072 -> 262143 : 295587 |******************* |
>> >> >> > 262144 -> 524287 : 75357 |**** |
>> >> >> > 524288 -> 1048575 : 15471 |* |
>> >> >> > 1048576 -> 2097151 : 2939 | |
>> >> >> > 2097152 -> 4194303 : 243 | |
>> >> >> > 4194304 -> 8388607 : 3 | |
>> >> >> >
>> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >> >> >
>> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
>> >> >> > node. The average latency is approximately 144us, with the maximum
>> >> >> > latency exceeding 4ms.
>> >> >> >
>> >> >> > - Value set to 0
>> >> >> >
>> >> >> > nsecs : count distribution
>> >> >> > 0 -> 1 : 0 | |
>> >> >> > 2 -> 3 : 0 | |
>> >> >> > 4 -> 7 : 0 | |
>> >> >> > 8 -> 15 : 0 | |
>> >> >> > 16 -> 31 : 0 | |
>> >> >> > 32 -> 63 : 0 | |
>> >> >> > 64 -> 127 : 0 | |
>> >> >> > 128 -> 255 : 0 | |
>> >> >> > 256 -> 511 : 24 | |
>> >> >> > 512 -> 1023 : 2686 | |
>> >> >> > 1024 -> 2047 : 10246 | |
>> >> >> > 2048 -> 4095 : 4061529 |********* |
>> >> >> > 4096 -> 8191 : 16894971 |****************************************|
>> >> >> > 8192 -> 16383 : 6279310 |************** |
>> >> >> > 16384 -> 32767 : 1658240 |*** |
>> >> >> > 32768 -> 65535 : 445760 |* |
>> >> >> > 65536 -> 131071 : 110817 | |
>> >> >> > 131072 -> 262143 : 20279 | |
>> >> >> > 262144 -> 524287 : 4176 | |
>> >> >> > 524288 -> 1048575 : 436 | |
>> >> >> > 1048576 -> 2097151 : 8 | |
>> >> >> > 2097152 -> 4194303 : 2 | |
>> >> >> >
>> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >> >
>> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
>> >> >> > max latency is less than 4ms.
>> >> >> >
>> >> >> > - Conclusion
>> >> >> >
>> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> >> >> > applications work well with the default setting.
>> >> >> >
>> >> >> > It is worth noting that all the above data were tested using the upstream
>> >> >> > kernel.
>> >> >> >
>> >> >> > Why introduce a systl knob?
>> >> >> > ===========================
>> >> >> >
>> >> >> > From the above data, it's clear that different CPU types have varying
>> >> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >> >
>> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> >> > setting for better throughput. In our production environment, we set this
>> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
>> >> >> > at the default value of 5 for other applications like big data. It's not
>> >> >> > common to release individual kernel packages for each application.
>> >> >>
>> >> >> Thanks for detailed performance data!
>> >> >>
>> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
>> >> >> your environment? If not, I suggest to use 0 as default for
>> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
>> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
>> >> >> that, if someone found some other workloads need larger
>> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >> >>
>> >> >
>> >> > The decision doesn’t rest with us, the kernel team at our company.
>> >> > It’s made by the system administrators who manage a large number of
>> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> >> > servers, not in other environments like big data servers. We have
>> >> > informed other system administrators, such as those managing the big
>> >> > data servers, about the latency spike issues, but they are unwilling
>> >> > to make the change.
>> >> >
>> >> > No one wants to make changes unless there is evidence showing that the
>> >> > old settings will negatively impact them. However, as you know,
>> >> > latency is not a critical concern for big data; throughput is more
>> >> > important. If we keep the current settings, we will have to release
>> >> > different kernel packages for different environments, which is a
>> >> > significant burden for us.
>> >>
>> >> Totally understand your requirements. And, I think that this is better
>> >> to be resolved in your downstream kernel. If there are clear evidences
>> >> to prove small batch number hurts throughput for some workloads, we can
>> >> make the change in the upstream kernel.
>> >>
>> >
>> > Please don't make this more complicated. We are at an impasse.
>> >
>> > The key issue here is that the upstream kernel has a default value of
>> > 5, not 0. If you can change it to 0, we can persuade our users to
>> > follow the upstream changes. They currently set it to 5, not because
>> > you, the author, chose this value, but because it is the default in
>> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
>> > support it. It's not just your decision as the author, but the entire
>> > community supports this default.
>> >
>> > If, in the future, we find that the value of 0 is not suitable, you'll
>> > tell us, "It is an issue in your downstream kernel, not in the
>> > upstream kernel, so we won't accept it." PANIC.
>>
>> I don't think so. I suggest you to change the default value to 0. If
>> someone reported that his workloads need some other value, then we have
>> evidence that different workloads need different value. At that time,
>> we can suggest to add an user tunable knob.
>>
>
> The problem is that others are unaware we've set it to 0, and I can't
> constantly monitor the linux-mm mailing list. Additionally, it's
> possible that you can't always keep an eye on it either.
IIUC, they will use the default value. Then, if there is any
performance regression, they can report it.
> I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 6:00 ` Huang, Ying
@ 2024-07-29 6:13 ` Yafang Shao
2024-07-29 6:14 ` Huang, Ying
0 siblings, 1 reply; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 6:13 UTC (permalink / raw)
To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Hi, Yafang,
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > During my recent work to resolve latency spikes caused by zone->lock
> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> >> >> >> > in practice.
> >> >> >>
> >> >> >> As we discussed before [1], I still feel confusing about the description
> >> >> >> about zone->lock contention. How about change the description to
> >> >> >> something like,
> >> >> >
> >> >> > Sure, I will change it.
> >> >> >
> >> >> >>
> >> >> >> Larger page allocation/freeing batch number may cause longer run time of
> >> >> >> code holding zone->lock. If zone->lock is heavily contended at the same
> >> >> >> time, latency spikes may occur even for casual page allocation/freeing.
> >> >> >> Although reducing the batch number cannot make zone->lock contended
> >> >> >> lighter, it can reduce the latency spikes effectively.
> >> >> >>
> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> >> >> >>
> >> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >> >
> >> >> >> > import mmap
> >> >> >> >
> >> >> >> > size = 6 * 1024**3
> >> >> >> >
> >> >> >> > while True:
> >> >> >> > mm = mmap.mmap(-1, size)
> >> >> >> > mm[:] = b'\xff' * size
> >> >> >> > mm.close()
> >> >> >> >
> >> >> >> > Run this script 10 times in parallel and measure the allocation latency by
> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> >> > funclatency[1]:
> >> >> >> >
> >> >> >> > funclatency -T -i 600 rmqueue_bulk
> >> >> >> >
> >> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >> >
> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> >> >> >> > =====================================================================
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> > 512 -> 1023 : 12 | |
> >> >> >> > 1024 -> 2047 : 9116 | |
> >> >> >> > 2048 -> 4095 : 2004 | |
> >> >> >> > 4096 -> 8191 : 2497 | |
> >> >> >> > 8192 -> 16383 : 2127 | |
> >> >> >> > 16384 -> 32767 : 2483 | |
> >> >> >> > 32768 -> 65535 : 10102 | |
> >> >> >> > 65536 -> 131071 : 212730 |******************* |
> >> >> >> > 131072 -> 262143 : 314692 |***************************** |
> >> >> >> > 262144 -> 524287 : 430058 |****************************************|
> >> >> >> > 524288 -> 1048575 : 224032 |******************** |
> >> >> >> > 1048576 -> 2097151 : 73567 |****** |
> >> >> >> > 2097152 -> 4194303 : 17079 |* |
> >> >> >> > 4194304 -> 8388607 : 3900 | |
> >> >> >> > 8388608 -> 16777215 : 750 | |
> >> >> >> > 16777216 -> 33554431 : 88 | |
> >> >> >> > 33554432 -> 67108863 : 2 | |
> >> >> >> >
> >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >> >> >> >
> >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
> >> >> >> > than 30ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> > 512 -> 1023 : 92 | |
> >> >> >> > 1024 -> 2047 : 8594 | |
> >> >> >> > 2048 -> 4095 : 2042818 |****** |
> >> >> >> > 4096 -> 8191 : 8737624 |************************** |
> >> >> >> > 8192 -> 16383 : 13147872 |****************************************|
> >> >> >> > 16384 -> 32767 : 8799951 |************************** |
> >> >> >> > 32768 -> 65535 : 2879715 |******** |
> >> >> >> > 65536 -> 131071 : 659600 |** |
> >> >> >> > 131072 -> 262143 : 204004 | |
> >> >> >> > 262144 -> 524287 : 78246 | |
> >> >> >> > 524288 -> 1048575 : 30800 | |
> >> >> >> > 1048576 -> 2097151 : 12251 | |
> >> >> >> > 2097152 -> 4194303 : 2950 | |
> >> >> >> > 4194304 -> 8388607 : 78 | |
> >> >> >> >
> >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >> >> >> >
> >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
> >> >> >> > to less than 8ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> >> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
> >> >> >> >
> >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
> >> >> >> > test it on different AMD models.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> >> >> > ============================================================
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> > 512 -> 1023 : 2419 | |
> >> >> >> > 1024 -> 2047 : 34499 |* |
> >> >> >> > 2048 -> 4095 : 4272 | |
> >> >> >> > 4096 -> 8191 : 9035 | |
> >> >> >> > 8192 -> 16383 : 4374 | |
> >> >> >> > 16384 -> 32767 : 2963 | |
> >> >> >> > 32768 -> 65535 : 6407 | |
> >> >> >> > 65536 -> 131071 : 884806 |****************************************|
> >> >> >> > 131072 -> 262143 : 145931 |****** |
> >> >> >> > 262144 -> 524287 : 13406 | |
> >> >> >> > 524288 -> 1048575 : 1874 | |
> >> >> >> > 1048576 -> 2097151 : 249 | |
> >> >> >> > 2097152 -> 4194303 : 28 | |
> >> >> >> >
> >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > This Intel CPU works fine with the default setting.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> >> >> > ==============================================================
> >> >> >> >
> >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> >> >> >> > node 0 only.
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 46 | |
> >> >> >> > 512 -> 1023 : 695 | |
> >> >> >> > 1024 -> 2047 : 19950 |* |
> >> >> >> > 2048 -> 4095 : 1788 | |
> >> >> >> > 4096 -> 8191 : 3392 | |
> >> >> >> > 8192 -> 16383 : 2569 | |
> >> >> >> > 16384 -> 32767 : 2619 | |
> >> >> >> > 32768 -> 65535 : 3809 | |
> >> >> >> > 65536 -> 131071 : 616182 |****************************************|
> >> >> >> > 131072 -> 262143 : 295587 |******************* |
> >> >> >> > 262144 -> 524287 : 75357 |**** |
> >> >> >> > 524288 -> 1048575 : 15471 |* |
> >> >> >> > 1048576 -> 2097151 : 2939 | |
> >> >> >> > 2097152 -> 4194303 : 243 | |
> >> >> >> > 4194304 -> 8388607 : 3 | |
> >> >> >> >
> >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >> >> >> >
> >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
> >> >> >> > node. The average latency is approximately 144us, with the maximum
> >> >> >> > latency exceeding 4ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> > nsecs : count distribution
> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> > 256 -> 511 : 24 | |
> >> >> >> > 512 -> 1023 : 2686 | |
> >> >> >> > 1024 -> 2047 : 10246 | |
> >> >> >> > 2048 -> 4095 : 4061529 |********* |
> >> >> >> > 4096 -> 8191 : 16894971 |****************************************|
> >> >> >> > 8192 -> 16383 : 6279310 |************** |
> >> >> >> > 16384 -> 32767 : 1658240 |*** |
> >> >> >> > 32768 -> 65535 : 445760 |* |
> >> >> >> > 65536 -> 131071 : 110817 | |
> >> >> >> > 131072 -> 262143 : 20279 | |
> >> >> >> > 262144 -> 524287 : 4176 | |
> >> >> >> > 524288 -> 1048575 : 436 | |
> >> >> >> > 1048576 -> 2097151 : 8 | |
> >> >> >> > 2097152 -> 4194303 : 2 | |
> >> >> >> >
> >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >> >> >
> >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
> >> >> >> > max latency is less than 4ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> >> >> >> > applications work well with the default setting.
> >> >> >> >
> >> >> >> > It is worth noting that all the above data were tested using the upstream
> >> >> >> > kernel.
> >> >> >> >
> >> >> >> > Why introduce a systl knob?
> >> >> >> > ===========================
> >> >> >> >
> >> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >> >
> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> >> > setting for better throughput. In our production environment, we set this
> >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
> >> >> >> > at the default value of 5 for other applications like big data. It's not
> >> >> >> > common to release individual kernel packages for each application.
> >> >> >>
> >> >> >> Thanks for detailed performance data!
> >> >> >>
> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> >> >> >> your environment? If not, I suggest to use 0 as default for
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
> >> >> >> that, if someone found some other workloads need larger
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >> >> >>
> >> >> >
> >> >> > The decision doesn’t rest with us, the kernel team at our company.
> >> >> > It’s made by the system administrators who manage a large number of
> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> >> > servers, not in other environments like big data servers. We have
> >> >> > informed other system administrators, such as those managing the big
> >> >> > data servers, about the latency spike issues, but they are unwilling
> >> >> > to make the change.
> >> >> >
> >> >> > No one wants to make changes unless there is evidence showing that the
> >> >> > old settings will negatively impact them. However, as you know,
> >> >> > latency is not a critical concern for big data; throughput is more
> >> >> > important. If we keep the current settings, we will have to release
> >> >> > different kernel packages for different environments, which is a
> >> >> > significant burden for us.
> >> >>
> >> >> Totally understand your requirements. And, I think that this is better
> >> >> to be resolved in your downstream kernel. If there are clear evidences
> >> >> to prove small batch number hurts throughput for some workloads, we can
> >> >> make the change in the upstream kernel.
> >> >>
> >> >
> >> > Please don't make this more complicated. We are at an impasse.
> >> >
> >> > The key issue here is that the upstream kernel has a default value of
> >> > 5, not 0. If you can change it to 0, we can persuade our users to
> >> > follow the upstream changes. They currently set it to 5, not because
> >> > you, the author, chose this value, but because it is the default in
> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> >> > support it. It's not just your decision as the author, but the entire
> >> > community supports this default.
> >> >
> >> > If, in the future, we find that the value of 0 is not suitable, you'll
> >> > tell us, "It is an issue in your downstream kernel, not in the
> >> > upstream kernel, so we won't accept it." PANIC.
> >>
> >> I don't think so. I suggest you to change the default value to 0. If
> >> someone reported that his workloads need some other value, then we have
> >> evidence that different workloads need different value. At that time,
> >> we can suggest to add an user tunable knob.
> >>
> >
> > The problem is that others are unaware we've set it to 0, and I can't
> > constantly monitor the linux-mm mailing list. Additionally, it's
> > possible that you can't always keep an eye on it either.
>
> IIUC, they will use the default value. Then, if there is any
> performance regression, they can report it.
Now we report it. What is your replyment? "Keep it in your downstream
kernel." Wow, PANIC again.
>
> > I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
>
> --
> Best Regards,
> Huang, Ying
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 6:13 ` Yafang Shao
@ 2024-07-29 6:14 ` Huang, Ying
2024-07-29 7:50 ` Yafang Shao
0 siblings, 1 reply; 14+ messages in thread
From: Huang, Ying @ 2024-07-29 6:14 UTC (permalink / raw)
To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
Yafang Shao <laoar.shao@gmail.com> writes:
> On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Hi, Yafang,
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > During my recent work to resolve latency spikes caused by zone->lock
>> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
>> >> >> >> > in practice.
>> >> >> >>
>> >> >> >> As we discussed before [1], I still feel confusing about the description
>> >> >> >> about zone->lock contention. How about change the description to
>> >> >> >> something like,
>> >> >> >
>> >> >> > Sure, I will change it.
>> >> >> >
>> >> >> >>
>> >> >> >> Larger page allocation/freeing batch number may cause longer run time of
>> >> >> >> code holding zone->lock. If zone->lock is heavily contended at the same
>> >> >> >> time, latency spikes may occur even for casual page allocation/freeing.
>> >> >> >> Although reducing the batch number cannot make zone->lock contended
>> >> >> >> lighter, it can reduce the latency spikes effectively.
>> >> >> >>
>> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> >> >> >>
>> >> >> >> > To demonstrate this, I wrote a Python script:
>> >> >> >> >
>> >> >> >> > import mmap
>> >> >> >> >
>> >> >> >> > size = 6 * 1024**3
>> >> >> >> >
>> >> >> >> > while True:
>> >> >> >> > mm = mmap.mmap(-1, size)
>> >> >> >> > mm[:] = b'\xff' * size
>> >> >> >> > mm.close()
>> >> >> >> >
>> >> >> >> > Run this script 10 times in parallel and measure the allocation latency by
>> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> >> >> > funclatency[1]:
>> >> >> >> >
>> >> >> >> > funclatency -T -i 600 rmqueue_bulk
>> >> >> >> >
>> >> >> >> > Here are the results for both AMD and Intel CPUs.
>> >> >> >> >
>> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
>> >> >> >> > =====================================================================
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> > nsecs : count distribution
>> >> >> >> > 0 -> 1 : 0 | |
>> >> >> >> > 2 -> 3 : 0 | |
>> >> >> >> > 4 -> 7 : 0 | |
>> >> >> >> > 8 -> 15 : 0 | |
>> >> >> >> > 16 -> 31 : 0 | |
>> >> >> >> > 32 -> 63 : 0 | |
>> >> >> >> > 64 -> 127 : 0 | |
>> >> >> >> > 128 -> 255 : 0 | |
>> >> >> >> > 256 -> 511 : 0 | |
>> >> >> >> > 512 -> 1023 : 12 | |
>> >> >> >> > 1024 -> 2047 : 9116 | |
>> >> >> >> > 2048 -> 4095 : 2004 | |
>> >> >> >> > 4096 -> 8191 : 2497 | |
>> >> >> >> > 8192 -> 16383 : 2127 | |
>> >> >> >> > 16384 -> 32767 : 2483 | |
>> >> >> >> > 32768 -> 65535 : 10102 | |
>> >> >> >> > 65536 -> 131071 : 212730 |******************* |
>> >> >> >> > 131072 -> 262143 : 314692 |***************************** |
>> >> >> >> > 262144 -> 524287 : 430058 |****************************************|
>> >> >> >> > 524288 -> 1048575 : 224032 |******************** |
>> >> >> >> > 1048576 -> 2097151 : 73567 |****** |
>> >> >> >> > 2097152 -> 4194303 : 17079 |* |
>> >> >> >> > 4194304 -> 8388607 : 3900 | |
>> >> >> >> > 8388608 -> 16777215 : 750 | |
>> >> >> >> > 16777216 -> 33554431 : 88 | |
>> >> >> >> > 33554432 -> 67108863 : 2 | |
>> >> >> >> >
>> >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >> >> >> >
>> >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
>> >> >> >> > than 30ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> > nsecs : count distribution
>> >> >> >> > 0 -> 1 : 0 | |
>> >> >> >> > 2 -> 3 : 0 | |
>> >> >> >> > 4 -> 7 : 0 | |
>> >> >> >> > 8 -> 15 : 0 | |
>> >> >> >> > 16 -> 31 : 0 | |
>> >> >> >> > 32 -> 63 : 0 | |
>> >> >> >> > 64 -> 127 : 0 | |
>> >> >> >> > 128 -> 255 : 0 | |
>> >> >> >> > 256 -> 511 : 0 | |
>> >> >> >> > 512 -> 1023 : 92 | |
>> >> >> >> > 1024 -> 2047 : 8594 | |
>> >> >> >> > 2048 -> 4095 : 2042818 |****** |
>> >> >> >> > 4096 -> 8191 : 8737624 |************************** |
>> >> >> >> > 8192 -> 16383 : 13147872 |****************************************|
>> >> >> >> > 16384 -> 32767 : 8799951 |************************** |
>> >> >> >> > 32768 -> 65535 : 2879715 |******** |
>> >> >> >> > 65536 -> 131071 : 659600 |** |
>> >> >> >> > 131072 -> 262143 : 204004 | |
>> >> >> >> > 262144 -> 524287 : 78246 | |
>> >> >> >> > 524288 -> 1048575 : 30800 | |
>> >> >> >> > 1048576 -> 2097151 : 12251 | |
>> >> >> >> > 2097152 -> 4194303 : 2950 | |
>> >> >> >> > 4194304 -> 8388607 : 78 | |
>> >> >> >> >
>> >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >> >> >> >
>> >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
>> >> >> >> > to less than 8ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
>> >> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
>> >> >> >> >
>> >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
>> >> >> >> > test it on different AMD models.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> >> >> > ============================================================
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> > nsecs : count distribution
>> >> >> >> > 0 -> 1 : 0 | |
>> >> >> >> > 2 -> 3 : 0 | |
>> >> >> >> > 4 -> 7 : 0 | |
>> >> >> >> > 8 -> 15 : 0 | |
>> >> >> >> > 16 -> 31 : 0 | |
>> >> >> >> > 32 -> 63 : 0 | |
>> >> >> >> > 64 -> 127 : 0 | |
>> >> >> >> > 128 -> 255 : 0 | |
>> >> >> >> > 256 -> 511 : 0 | |
>> >> >> >> > 512 -> 1023 : 2419 | |
>> >> >> >> > 1024 -> 2047 : 34499 |* |
>> >> >> >> > 2048 -> 4095 : 4272 | |
>> >> >> >> > 4096 -> 8191 : 9035 | |
>> >> >> >> > 8192 -> 16383 : 4374 | |
>> >> >> >> > 16384 -> 32767 : 2963 | |
>> >> >> >> > 32768 -> 65535 : 6407 | |
>> >> >> >> > 65536 -> 131071 : 884806 |****************************************|
>> >> >> >> > 131072 -> 262143 : 145931 |****** |
>> >> >> >> > 262144 -> 524287 : 13406 | |
>> >> >> >> > 524288 -> 1048575 : 1874 | |
>> >> >> >> > 1048576 -> 2097151 : 249 | |
>> >> >> >> > 2097152 -> 4194303 : 28 | |
>> >> >> >> >
>> >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > This Intel CPU works fine with the default setting.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> >> >> > ==============================================================
>> >> >> >> >
>> >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
>> >> >> >> > node 0 only.
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> > nsecs : count distribution
>> >> >> >> > 0 -> 1 : 0 | |
>> >> >> >> > 2 -> 3 : 0 | |
>> >> >> >> > 4 -> 7 : 0 | |
>> >> >> >> > 8 -> 15 : 0 | |
>> >> >> >> > 16 -> 31 : 0 | |
>> >> >> >> > 32 -> 63 : 0 | |
>> >> >> >> > 64 -> 127 : 0 | |
>> >> >> >> > 128 -> 255 : 0 | |
>> >> >> >> > 256 -> 511 : 46 | |
>> >> >> >> > 512 -> 1023 : 695 | |
>> >> >> >> > 1024 -> 2047 : 19950 |* |
>> >> >> >> > 2048 -> 4095 : 1788 | |
>> >> >> >> > 4096 -> 8191 : 3392 | |
>> >> >> >> > 8192 -> 16383 : 2569 | |
>> >> >> >> > 16384 -> 32767 : 2619 | |
>> >> >> >> > 32768 -> 65535 : 3809 | |
>> >> >> >> > 65536 -> 131071 : 616182 |****************************************|
>> >> >> >> > 131072 -> 262143 : 295587 |******************* |
>> >> >> >> > 262144 -> 524287 : 75357 |**** |
>> >> >> >> > 524288 -> 1048575 : 15471 |* |
>> >> >> >> > 1048576 -> 2097151 : 2939 | |
>> >> >> >> > 2097152 -> 4194303 : 243 | |
>> >> >> >> > 4194304 -> 8388607 : 3 | |
>> >> >> >> >
>> >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >> >> >> >
>> >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
>> >> >> >> > node. The average latency is approximately 144us, with the maximum
>> >> >> >> > latency exceeding 4ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> > nsecs : count distribution
>> >> >> >> > 0 -> 1 : 0 | |
>> >> >> >> > 2 -> 3 : 0 | |
>> >> >> >> > 4 -> 7 : 0 | |
>> >> >> >> > 8 -> 15 : 0 | |
>> >> >> >> > 16 -> 31 : 0 | |
>> >> >> >> > 32 -> 63 : 0 | |
>> >> >> >> > 64 -> 127 : 0 | |
>> >> >> >> > 128 -> 255 : 0 | |
>> >> >> >> > 256 -> 511 : 24 | |
>> >> >> >> > 512 -> 1023 : 2686 | |
>> >> >> >> > 1024 -> 2047 : 10246 | |
>> >> >> >> > 2048 -> 4095 : 4061529 |********* |
>> >> >> >> > 4096 -> 8191 : 16894971 |****************************************|
>> >> >> >> > 8192 -> 16383 : 6279310 |************** |
>> >> >> >> > 16384 -> 32767 : 1658240 |*** |
>> >> >> >> > 32768 -> 65535 : 445760 |* |
>> >> >> >> > 65536 -> 131071 : 110817 | |
>> >> >> >> > 131072 -> 262143 : 20279 | |
>> >> >> >> > 262144 -> 524287 : 4176 | |
>> >> >> >> > 524288 -> 1048575 : 436 | |
>> >> >> >> > 1048576 -> 2097151 : 8 | |
>> >> >> >> > 2097152 -> 4194303 : 2 | |
>> >> >> >> >
>> >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >> >> >
>> >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
>> >> >> >> > max latency is less than 4ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> >> >> >> > applications work well with the default setting.
>> >> >> >> >
>> >> >> >> > It is worth noting that all the above data were tested using the upstream
>> >> >> >> > kernel.
>> >> >> >> >
>> >> >> >> > Why introduce a systl knob?
>> >> >> >> > ===========================
>> >> >> >> >
>> >> >> >> > From the above data, it's clear that different CPU types have varying
>> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >> >> >
>> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> >> >> > setting for better throughput. In our production environment, we set this
>> >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
>> >> >> >> > at the default value of 5 for other applications like big data. It's not
>> >> >> >> > common to release individual kernel packages for each application.
>> >> >> >>
>> >> >> >> Thanks for detailed performance data!
>> >> >> >>
>> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
>> >> >> >> your environment? If not, I suggest to use 0 as default for
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
>> >> >> >> that, if someone found some other workloads need larger
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >> >> >>
>> >> >> >
>> >> >> > The decision doesn’t rest with us, the kernel team at our company.
>> >> >> > It’s made by the system administrators who manage a large number of
>> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> >> >> > servers, not in other environments like big data servers. We have
>> >> >> > informed other system administrators, such as those managing the big
>> >> >> > data servers, about the latency spike issues, but they are unwilling
>> >> >> > to make the change.
>> >> >> >
>> >> >> > No one wants to make changes unless there is evidence showing that the
>> >> >> > old settings will negatively impact them. However, as you know,
>> >> >> > latency is not a critical concern for big data; throughput is more
>> >> >> > important. If we keep the current settings, we will have to release
>> >> >> > different kernel packages for different environments, which is a
>> >> >> > significant burden for us.
>> >> >>
>> >> >> Totally understand your requirements. And, I think that this is better
>> >> >> to be resolved in your downstream kernel. If there are clear evidences
>> >> >> to prove small batch number hurts throughput for some workloads, we can
>> >> >> make the change in the upstream kernel.
>> >> >>
>> >> >
>> >> > Please don't make this more complicated. We are at an impasse.
>> >> >
>> >> > The key issue here is that the upstream kernel has a default value of
>> >> > 5, not 0. If you can change it to 0, we can persuade our users to
>> >> > follow the upstream changes. They currently set it to 5, not because
>> >> > you, the author, chose this value, but because it is the default in
>> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
>> >> > support it. It's not just your decision as the author, but the entire
>> >> > community supports this default.
>> >> >
>> >> > If, in the future, we find that the value of 0 is not suitable, you'll
>> >> > tell us, "It is an issue in your downstream kernel, not in the
>> >> > upstream kernel, so we won't accept it." PANIC.
>> >>
>> >> I don't think so. I suggest you to change the default value to 0. If
>> >> someone reported that his workloads need some other value, then we have
>> >> evidence that different workloads need different value. At that time,
>> >> we can suggest to add an user tunable knob.
>> >>
>> >
>> > The problem is that others are unaware we've set it to 0, and I can't
>> > constantly monitor the linux-mm mailing list. Additionally, it's
>> > possible that you can't always keep an eye on it either.
>>
>> IIUC, they will use the default value. Then, if there is any
>> performance regression, they can report it.
>
> Now we report it. What is your replyment? "Keep it in your downstream
> kernel." Wow, PANIC again.
This is not all of my reply. I suggested you to change the default
value too.
>
>>
>> > I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
>>
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
2024-07-29 6:14 ` Huang, Ying
@ 2024-07-29 7:50 ` Yafang Shao
0 siblings, 0 replies; 14+ messages in thread
From: Yafang Shao @ 2024-07-29 7:50 UTC (permalink / raw)
To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes
On Mon, Jul 29, 2024 at 2:18 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Hi, Yafang,
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > During my recent work to resolve latency spikes caused by zone->lock
> >> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> >> >> >> >> > in practice.
> >> >> >> >>
> >> >> >> >> As we discussed before [1], I still feel confusing about the description
> >> >> >> >> about zone->lock contention. How about change the description to
> >> >> >> >> something like,
> >> >> >> >
> >> >> >> > Sure, I will change it.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> Larger page allocation/freeing batch number may cause longer run time of
> >> >> >> >> code holding zone->lock. If zone->lock is heavily contended at the same
> >> >> >> >> time, latency spikes may occur even for casual page allocation/freeing.
> >> >> >> >> Although reducing the batch number cannot make zone->lock contended
> >> >> >> >> lighter, it can reduce the latency spikes effectively.
> >> >> >> >>
> >> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@yhuang6-desk2.ccr.corp.intel.com/
> >> >> >> >>
> >> >> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >> >> >
> >> >> >> >> > import mmap
> >> >> >> >> >
> >> >> >> >> > size = 6 * 1024**3
> >> >> >> >> >
> >> >> >> >> > while True:
> >> >> >> >> > mm = mmap.mmap(-1, size)
> >> >> >> >> > mm[:] = b'\xff' * size
> >> >> >> >> > mm.close()
> >> >> >> >> >
> >> >> >> >> > Run this script 10 times in parallel and measure the allocation latency by
> >> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> >> >> > funclatency[1]:
> >> >> >> >> >
> >> >> >> >> > funclatency -T -i 600 rmqueue_bulk
> >> >> >> >> >
> >> >> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >> >> >
> >> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> >> >> >> >> > =====================================================================
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> > nsecs : count distribution
> >> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> >> > 512 -> 1023 : 12 | |
> >> >> >> >> > 1024 -> 2047 : 9116 | |
> >> >> >> >> > 2048 -> 4095 : 2004 | |
> >> >> >> >> > 4096 -> 8191 : 2497 | |
> >> >> >> >> > 8192 -> 16383 : 2127 | |
> >> >> >> >> > 16384 -> 32767 : 2483 | |
> >> >> >> >> > 32768 -> 65535 : 10102 | |
> >> >> >> >> > 65536 -> 131071 : 212730 |******************* |
> >> >> >> >> > 131072 -> 262143 : 314692 |***************************** |
> >> >> >> >> > 262144 -> 524287 : 430058 |****************************************|
> >> >> >> >> > 524288 -> 1048575 : 224032 |******************** |
> >> >> >> >> > 1048576 -> 2097151 : 73567 |****** |
> >> >> >> >> > 2097152 -> 4194303 : 17079 |* |
> >> >> >> >> > 4194304 -> 8388607 : 3900 | |
> >> >> >> >> > 8388608 -> 16777215 : 750 | |
> >> >> >> >> > 16777216 -> 33554431 : 88 | |
> >> >> >> >> > 33554432 -> 67108863 : 2 | |
> >> >> >> >> >
> >> >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >> >> >> >> >
> >> >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
> >> >> >> >> > than 30ms.
> >> >> >> >> >
> >> >> >> >> > - Value set to 0
> >> >> >> >> >
> >> >> >> >> > nsecs : count distribution
> >> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> >> > 512 -> 1023 : 92 | |
> >> >> >> >> > 1024 -> 2047 : 8594 | |
> >> >> >> >> > 2048 -> 4095 : 2042818 |****** |
> >> >> >> >> > 4096 -> 8191 : 8737624 |************************** |
> >> >> >> >> > 8192 -> 16383 : 13147872 |****************************************|
> >> >> >> >> > 16384 -> 32767 : 8799951 |************************** |
> >> >> >> >> > 32768 -> 65535 : 2879715 |******** |
> >> >> >> >> > 65536 -> 131071 : 659600 |** |
> >> >> >> >> > 131072 -> 262143 : 204004 | |
> >> >> >> >> > 262144 -> 524287 : 78246 | |
> >> >> >> >> > 524288 -> 1048575 : 30800 | |
> >> >> >> >> > 1048576 -> 2097151 : 12251 | |
> >> >> >> >> > 2097152 -> 4194303 : 2950 | |
> >> >> >> >> > 4194304 -> 8388607 : 78 | |
> >> >> >> >> >
> >> >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >> >> >> >> >
> >> >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
> >> >> >> >> > to less than 8ms.
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> >> >> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
> >> >> >> >> >
> >> >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
> >> >> >> >> > test it on different AMD models.
> >> >> >> >> >
> >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> >> >> >> > ============================================================
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> > nsecs : count distribution
> >> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> >> > 256 -> 511 : 0 | |
> >> >> >> >> > 512 -> 1023 : 2419 | |
> >> >> >> >> > 1024 -> 2047 : 34499 |* |
> >> >> >> >> > 2048 -> 4095 : 4272 | |
> >> >> >> >> > 4096 -> 8191 : 9035 | |
> >> >> >> >> > 8192 -> 16383 : 4374 | |
> >> >> >> >> > 16384 -> 32767 : 2963 | |
> >> >> >> >> > 32768 -> 65535 : 6407 | |
> >> >> >> >> > 65536 -> 131071 : 884806 |****************************************|
> >> >> >> >> > 131072 -> 262143 : 145931 |****** |
> >> >> >> >> > 262144 -> 524287 : 13406 | |
> >> >> >> >> > 524288 -> 1048575 : 1874 | |
> >> >> >> >> > 1048576 -> 2097151 : 249 | |
> >> >> >> >> > 2097152 -> 4194303 : 28 | |
> >> >> >> >> >
> >> >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > This Intel CPU works fine with the default setting.
> >> >> >> >> >
> >> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> >> >> >> > ==============================================================
> >> >> >> >> >
> >> >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> >> >> >> >> > node 0 only.
> >> >> >> >> >
> >> >> >> >> > - Default value of 5
> >> >> >> >> >
> >> >> >> >> > nsecs : count distribution
> >> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> >> > 256 -> 511 : 46 | |
> >> >> >> >> > 512 -> 1023 : 695 | |
> >> >> >> >> > 1024 -> 2047 : 19950 |* |
> >> >> >> >> > 2048 -> 4095 : 1788 | |
> >> >> >> >> > 4096 -> 8191 : 3392 | |
> >> >> >> >> > 8192 -> 16383 : 2569 | |
> >> >> >> >> > 16384 -> 32767 : 2619 | |
> >> >> >> >> > 32768 -> 65535 : 3809 | |
> >> >> >> >> > 65536 -> 131071 : 616182 |****************************************|
> >> >> >> >> > 131072 -> 262143 : 295587 |******************* |
> >> >> >> >> > 262144 -> 524287 : 75357 |**** |
> >> >> >> >> > 524288 -> 1048575 : 15471 |* |
> >> >> >> >> > 1048576 -> 2097151 : 2939 | |
> >> >> >> >> > 2097152 -> 4194303 : 243 | |
> >> >> >> >> > 4194304 -> 8388607 : 3 | |
> >> >> >> >> >
> >> >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >> >> >> >> >
> >> >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
> >> >> >> >> > node. The average latency is approximately 144us, with the maximum
> >> >> >> >> > latency exceeding 4ms.
> >> >> >> >> >
> >> >> >> >> > - Value set to 0
> >> >> >> >> >
> >> >> >> >> > nsecs : count distribution
> >> >> >> >> > 0 -> 1 : 0 | |
> >> >> >> >> > 2 -> 3 : 0 | |
> >> >> >> >> > 4 -> 7 : 0 | |
> >> >> >> >> > 8 -> 15 : 0 | |
> >> >> >> >> > 16 -> 31 : 0 | |
> >> >> >> >> > 32 -> 63 : 0 | |
> >> >> >> >> > 64 -> 127 : 0 | |
> >> >> >> >> > 128 -> 255 : 0 | |
> >> >> >> >> > 256 -> 511 : 24 | |
> >> >> >> >> > 512 -> 1023 : 2686 | |
> >> >> >> >> > 1024 -> 2047 : 10246 | |
> >> >> >> >> > 2048 -> 4095 : 4061529 |********* |
> >> >> >> >> > 4096 -> 8191 : 16894971 |****************************************|
> >> >> >> >> > 8192 -> 16383 : 6279310 |************** |
> >> >> >> >> > 16384 -> 32767 : 1658240 |*** |
> >> >> >> >> > 32768 -> 65535 : 445760 |* |
> >> >> >> >> > 65536 -> 131071 : 110817 | |
> >> >> >> >> > 131072 -> 262143 : 20279 | |
> >> >> >> >> > 262144 -> 524287 : 4176 | |
> >> >> >> >> > 524288 -> 1048575 : 436 | |
> >> >> >> >> > 1048576 -> 2097151 : 8 | |
> >> >> >> >> > 2097152 -> 4194303 : 2 | |
> >> >> >> >> >
> >> >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >> >> >> >
> >> >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
> >> >> >> >> > max latency is less than 4ms.
> >> >> >> >> >
> >> >> >> >> > - Conclusion
> >> >> >> >> >
> >> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> >> >> >> >> > applications work well with the default setting.
> >> >> >> >> >
> >> >> >> >> > It is worth noting that all the above data were tested using the upstream
> >> >> >> >> > kernel.
> >> >> >> >> >
> >> >> >> >> > Why introduce a systl knob?
> >> >> >> >> > ===========================
> >> >> >> >> >
> >> >> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >> >> >
> >> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> >> >> > setting for better throughput. In our production environment, we set this
> >> >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
> >> >> >> >> > at the default value of 5 for other applications like big data. It's not
> >> >> >> >> > common to release individual kernel packages for each application.
> >> >> >> >>
> >> >> >> >> Thanks for detailed performance data!
> >> >> >> >>
> >> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> >> >> >> >> your environment? If not, I suggest to use 0 as default for
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After
> >> >> >> >> that, if someone found some other workloads need larger
> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >> >> >> >>
> >> >> >> >
> >> >> >> > The decision doesn’t rest with us, the kernel team at our company.
> >> >> >> > It’s made by the system administrators who manage a large number of
> >> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> >> >> > servers, not in other environments like big data servers. We have
> >> >> >> > informed other system administrators, such as those managing the big
> >> >> >> > data servers, about the latency spike issues, but they are unwilling
> >> >> >> > to make the change.
> >> >> >> >
> >> >> >> > No one wants to make changes unless there is evidence showing that the
> >> >> >> > old settings will negatively impact them. However, as you know,
> >> >> >> > latency is not a critical concern for big data; throughput is more
> >> >> >> > important. If we keep the current settings, we will have to release
> >> >> >> > different kernel packages for different environments, which is a
> >> >> >> > significant burden for us.
> >> >> >>
> >> >> >> Totally understand your requirements. And, I think that this is better
> >> >> >> to be resolved in your downstream kernel. If there are clear evidences
> >> >> >> to prove small batch number hurts throughput for some workloads, we can
> >> >> >> make the change in the upstream kernel.
> >> >> >>
> >> >> >
> >> >> > Please don't make this more complicated. We are at an impasse.
> >> >> >
> >> >> > The key issue here is that the upstream kernel has a default value of
> >> >> > 5, not 0. If you can change it to 0, we can persuade our users to
> >> >> > follow the upstream changes. They currently set it to 5, not because
> >> >> > you, the author, chose this value, but because it is the default in
> >> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> >> >> > support it. It's not just your decision as the author, but the entire
> >> >> > community supports this default.
> >> >> >
> >> >> > If, in the future, we find that the value of 0 is not suitable, you'll
> >> >> > tell us, "It is an issue in your downstream kernel, not in the
> >> >> > upstream kernel, so we won't accept it." PANIC.
> >> >>
> >> >> I don't think so. I suggest you to change the default value to 0. If
> >> >> someone reported that his workloads need some other value, then we have
> >> >> evidence that different workloads need different value. At that time,
> >> >> we can suggest to add an user tunable knob.
> >> >>
> >> >
> >> > The problem is that others are unaware we've set it to 0, and I can't
> >> > constantly monitor the linux-mm mailing list. Additionally, it's
> >> > possible that you can't always keep an eye on it either.
> >>
> >> IIUC, they will use the default value. Then, if there is any
> >> performance regression, they can report it.
> >
> > Now we report it. What is your replyment? "Keep it in your downstream
> > kernel." Wow, PANIC again.
>
> This is not all of my reply. I suggested you to change the default
> value too.
For the upstream kernel, I don't have a strong justification to change
the default value from 5 to 0. That's why I'm proposing to introduce a
sysctl.
For our downstream kernel, some system administrators want us to keep
this value the same as the upstream because it works fine with the old
default value of 5. Therefore, we can't set the default value of our
downstream kernel to 0.
Let's wait for Andrew's suggestion.
--
Regards
Yafang
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2024-07-29 7:51 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-29 2:35 [PATCH v2 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29 2:35 ` [PATCH v2 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-29 2:35 ` [PATCH v2 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-29 2:35 ` [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-29 3:18 ` Huang, Ying
2024-07-29 3:40 ` Yafang Shao
2024-07-29 5:12 ` Huang, Ying
2024-07-29 5:45 ` Yafang Shao
2024-07-29 5:50 ` Huang, Ying
2024-07-29 6:00 ` Yafang Shao
2024-07-29 6:00 ` Huang, Ying
2024-07-29 6:13 ` Yafang Shao
2024-07-29 6:14 ` Huang, Ying
2024-07-29 7:50 ` Yafang Shao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox