[PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max
@ 2024-08-04  8:01 Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Yafang Shao @ 2024-08-04  8:01 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

Background
==========

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Investigation
=============

Duration my investigation on this issue, I found the latency spikes were
caused by the zone->lock contention. That can be illustrated as follows,

   CPU A (Freer)                 CPU B (Allocator)
  lock zone->lock
  free pages                      lock zone->lock
  unlock zone->lock               
                                  alloc pages
                                  unlock zone->lock

If the Freer holds the zone->lock for an extended period, the Allocator
has to wait and thus latency spikes occures.

I also wrote a python script to reproduce it on my test servers. See the
dedails in patch #3. It is worth to note that the reproducer is based on
the upstream kernel.

Experimenting
=============

As the more pages to be freed in one batch, the long the duration will
be. So my attempt involves reducing the batch size. After I restrict the
batch to the smallest size, there is no complains on the latency spikes
any more.

However, duration my experiment, I found that the
CONFIG_PCP_BATCH_SCALE_MAX is hard to use in practice. So I try to
improve it in this series.

The Proposal
============

This series encompasses two minor refinements to the PCP high watermark
auto-tuning mechanism, along with the introduction of a new sysctl knob
that serves as a more practical alternative to the previous configuration
method.

Future work
===========

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[0], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[1]. However, implementing
these solutions is likely to necessitate a more extended development
effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]

Changes:
- v2->v3: 
  - commit log refinement
  - rebase it on mm-everything

- v1-> v2: https://lwn.net/Articles/983837/
  Commit log refinement

- v1: mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  https://lwn.net/Articles/981069/

- mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the
  minimum pagelist
  https://lore.kernel.org/linux-mm/20240701142046.6050-1-laoar.shao@gmail.com/

Yafang Shao (3):
  mm/page_alloc: A minor fix to the calculation of pcp->free_count
  mm/page_alloc: Avoid changing pcp->high decaying when adjusting
    CONFIG_PCP_BATCH_SCALE_MAX
  mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++
 mm/Kconfig                              | 11 -------
 mm/page_alloc.c                         | 40 ++++++++++++++++++-------
 3 files changed, 47 insertions(+), 21 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count
  2024-08-04  8:01 [PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-08-04  8:01 ` Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-08-04  8:01 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

Currently, At worst, the pcp->free_count can be
(batch - 1 + (1 << MAX_ORDER)), which may exceed the expected max value of
(batch << CONFIG_PCP_BATCH_SCALE_MAX).

This issue was identified through code review, and no real problems have
been observed.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47c0ce1d6fa1..d9371806f6b5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2614,7 +2614,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
-		pcp->free_count += (1 << order);
+		pcp->free_count = min(pcp->free_count + (1 << order),
+				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX
  2024-08-04  8:01 [PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
@ 2024-08-04  8:01 ` Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-08-04  8:01 UTC (permalink / raw)
  To: akpm; +Cc: ying.huang, mgorman, linux-mm, Yafang Shao

When adjusting the CONFIG_PCP_BATCH_SCALE_MAX configuration from its
default value of 5 to a lower value, such as 0, it's important to ensure
that the pcp->high decaying is not inadvertently slowed down. Similarly,
when increasing CONFIG_PCP_BATCH_SCALE_MAX to a larger value, like 6, we
must avoid inadvertently increasing the number of pages freed in
free_pcppages_bulk() as a result of this change.

So below improvements are made:
- hardcode the default value of 5 to avoiding modifying the pcp->high
- change free_pcppages_bulk() calling into multiple steps

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/page_alloc.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9371806f6b5..5a842cc13314 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2323,7 +2323,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
 	int high_min, to_drain, batch;
-	int todo = 0;
+	int todo = 0, count = 0;
 
 	high_min = READ_ONCE(pcp->high_min);
 	batch = READ_ONCE(pcp->batch);
@@ -2333,18 +2333,26 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 	 * control latency.  This caps pcp->high decrement too.
 	 */
 	if (pcp->high > high_min) {
-		pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+		/*
+		 * We will decay 1/8 pcp->high each time in general, so that the
+		 * idle PCP pages can be returned to buddy system timely. To
+		 * control the max latency of decay, we also constrain the
+		 * number pages freed each time.
+		 */
+		pcp->high = max3(pcp->count - (batch << 5),
 				 pcp->high - (pcp->high >> 3), high_min);
 		if (pcp->high > high_min)
 			todo++;
 	}
 
 	to_drain = pcp->count - pcp->high;
-	if (to_drain > 0) {
+	while (count < to_drain) {
 		spin_lock(&pcp->lock);
-		free_pcppages_bulk(zone, to_drain, pcp, 0);
+		free_pcppages_bulk(zone, min(batch, to_drain - count), pcp, 0);
 		spin_unlock(&pcp->lock);
+		count += batch;
 		todo++;
+		cond_resched();
 	}
 
 	return todo;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-04  8:01 [PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
  2024-08-04  8:01 ` [PATCH v3 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
@ 2024-08-04  8:01 ` Yafang Shao
  2024-08-05  1:38   ` Huang, Ying
  2 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-08-04  8:01 UTC (permalink / raw)
  To: akpm
  Cc: ying.huang, mgorman, linux-mm, Yafang Shao, Matthew Wilcox,
	David Rientjes

Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock.  If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.

To demonstrate this, I wrote a Python script:

  import mmap

  size = 6 * 1024**3

  while True:
      mm = mmap.mmap(-1, size)
      mm[:] = b'\xff' * size
      mm.close()

Run this script 10 times in parallel and measure the allocation latency by
measuring the duration of rmqueue_bulk() with the BCC tools
funclatency[0]:

  funclatency -T -i 600 rmqueue_bulk

Here are the results for both AMD and Intel CPUs.

AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
=====================================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 12       |                                        |
      1024 -> 2047       : 9116     |                                        |
      2048 -> 4095       : 2004     |                                        |
      4096 -> 8191       : 2497     |                                        |
      8192 -> 16383      : 2127     |                                        |
     16384 -> 32767      : 2483     |                                        |
     32768 -> 65535      : 10102    |                                        |
     65536 -> 131071     : 212730   |*******************                     |
    131072 -> 262143     : 314692   |*****************************           |
    262144 -> 524287     : 430058   |****************************************|
    524288 -> 1048575    : 224032   |********************                    |
   1048576 -> 2097151    : 73567    |******                                  |
   2097152 -> 4194303    : 17079    |*                                       |
   4194304 -> 8388607    : 3900     |                                        |
   8388608 -> 16777215   : 750      |                                        |
  16777216 -> 33554431   : 88       |                                        |
  33554432 -> 67108863   : 2        |                                        |

avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242

The avg alloc latency can be 449us, and the max latency can be higher
than 30ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 92       |                                        |
      1024 -> 2047       : 8594     |                                        |
      2048 -> 4095       : 2042818  |******                                  |
      4096 -> 8191       : 8737624  |**************************              |
      8192 -> 16383      : 13147872 |****************************************|
     16384 -> 32767      : 8799951  |**************************              |
     32768 -> 65535      : 2879715  |********                                |
     65536 -> 131071     : 659600   |**                                      |
    131072 -> 262143     : 204004   |                                        |
    262144 -> 524287     : 78246    |                                        |
    524288 -> 1048575    : 30800    |                                        |
   1048576 -> 2097151    : 12251    |                                        |
   2097152 -> 4194303    : 2950     |                                        |
   4194304 -> 8388607    : 78       |                                        |

avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636

The avg was reduced significantly to 19us, and the max latency is reduced
to less than 8ms.

- Conclusion

On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
latency. Latency-sensitive applications will benefit from this tuning.

However, I don't have access to other types of AMD CPUs, so I was unable to
test it on different AMD models.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
============================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 2419     |                                        |
      1024 -> 2047       : 34499    |*                                       |
      2048 -> 4095       : 4272     |                                        |
      4096 -> 8191       : 9035     |                                        |
      8192 -> 16383      : 4374     |                                        |
     16384 -> 32767      : 2963     |                                        |
     32768 -> 65535      : 6407     |                                        |
     65536 -> 131071     : 884806   |****************************************|
    131072 -> 262143     : 145931   |******                                  |
    262144 -> 524287     : 13406    |                                        |
    524288 -> 1048575    : 1874     |                                        |
   1048576 -> 2097151    : 249      |                                        |
   2097152 -> 4194303    : 28       |                                        |

avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263

- Conclusion

This Intel CPU works fine with the default setting.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
==============================================================

Using the cpuset cgroup, we can restrict the test script to run on NUMA
node 0 only.

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 46       |                                        |
       512 -> 1023       : 695      |                                        |
      1024 -> 2047       : 19950    |*                                       |
      2048 -> 4095       : 1788     |                                        |
      4096 -> 8191       : 3392     |                                        |
      8192 -> 16383      : 2569     |                                        |
     16384 -> 32767      : 2619     |                                        |
     32768 -> 65535      : 3809     |                                        |
     65536 -> 131071     : 616182   |****************************************|
    131072 -> 262143     : 295587   |*******************                     |
    262144 -> 524287     : 75357    |****                                    |
    524288 -> 1048575    : 15471    |*                                       |
   1048576 -> 2097151    : 2939     |                                        |
   2097152 -> 4194303    : 243      |                                        |
   4194304 -> 8388607    : 3        |                                        |

avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651

The zone->lock contention becomes severe when there is only a single NUMA
node. The average latency is approximately 144us, with the maximum
latency exceeding 4ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 24       |                                        |
       512 -> 1023       : 2686     |                                        |
      1024 -> 2047       : 10246    |                                        |
      2048 -> 4095       : 4061529  |*********                               |
      4096 -> 8191       : 16894971 |****************************************|
      8192 -> 16383      : 6279310  |**************                          |
     16384 -> 32767      : 1658240  |***                                     |
     32768 -> 65535      : 445760   |*                                       |
     65536 -> 131071     : 110817   |                                        |
    131072 -> 262143     : 20279    |                                        |
    262144 -> 524287     : 4176     |                                        |
    524288 -> 1048575    : 436      |                                        |
   1048576 -> 2097151    : 8        |                                        |
   2097152 -> 4194303    : 2        |                                        |

avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508

After setting it to 0, the avg latency is reduced to around 8us, and the
max latency is less than 4ms.

- Conclusion

On this Intel CPU, this tuning doesn't help much. Latency-sensitive
applications work well with the default setting.

It is worth noting that all the above data were tested using the upstream
kernel.

Why introduce a systl knob?
===========================

From the above data, it's clear that different CPU types have varying
allocation latencies concerning zone->lock contention. Typically, people
don't release individual kernel packages for each type of x86_64 CPU.

Furthermore, for latency-insensitive applications, we can keep the default
setting for better throughput. In our production environment, we set this
value to 0 for applications running on Kubernetes servers while keeping it
at the default value of 5 for other applications like big data. It's not
common to release individual kernel packages for each application.

Future work
===========

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[1], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[2]. However, implementing
these solutions is likely to necessitate a more extended development
effort.

Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [0]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [1]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [2]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++
 mm/Kconfig                              | 11 -----------
 mm/page_alloc.c                         | 23 +++++++++++++++++------
 3 files changed, 34 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f48eaa98d22d..4971289dfb79 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- pcp_batch_scale_max
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -883,6 +884,22 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+pcp_batch_scale_max
+===================
+
+In page allocator, PCP (Per-CPU pageset) is refilled and drained in
+batches.  The batch number is scaled automatically to improve page
+allocation/free throughput.  But too large scale factor may hurt
+latency.  This option sets the upper limit of scale factor to limit
+the maximum latency.
+
+The range for this parameter spans from 0 to 6, with a default value of 5.
+The value assigned to 'N' signifies that during each refilling or draining
+process, a maximum of (batch << N) pages will be involved, where "batch"
+represents the default batch size automatically computed by the kernel for
+each zone.
+
+
 percpu_pagelist_high_fraction
 =============================
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 7b716ac80272..14f64b4f744a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -690,17 +690,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
-config PCP_BATCH_SCALE_MAX
-	int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
-	default 5
-	range 0 6
-	help
-	  In page allocator, PCP (Per-CPU pageset) is refilled and drained in
-	  batches.  The batch number is scaled automatically to improve page
-	  allocation/free throughput.  But too large scale factor may hurt
-	  latency.  This option sets the upper limit of scale factor to limit
-	  the maximum latency.
-
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5a842cc13314..bf0c94a0b659 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -273,6 +273,8 @@ int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 static int watermark_boost_factor __read_mostly = 15000;
 static int watermark_scale_factor = 10;
+static int pcp_batch_scale_max = 5;
+static int sysctl_6 = 6;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -2391,7 +2393,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
 		count = pcp->count;
 		if (count) {
 			int to_drain = min(count,
-				pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
+				pcp->batch << pcp_batch_scale_max);
 
 			free_pcppages_bulk(zone, to_drain, pcp, 0);
 			count -= to_drain;
@@ -2519,7 +2521,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
 
 	/* Free as much as possible if batch freeing high-order pages. */
 	if (unlikely(free_high))
-		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
+		return min(pcp->count, batch << pcp_batch_scale_max);
 
 	/* Check for PCP disabled or boot pageset */
 	if (unlikely(high < batch))
@@ -2551,7 +2553,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 		return 0;
 
 	if (unlikely(free_high)) {
-		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
+		pcp->high = max(high - (batch << pcp_batch_scale_max),
 				high_min);
 		return 0;
 	}
@@ -2621,9 +2623,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
 	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
 		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
 	}
-	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+	if (pcp->free_count < (batch << pcp_batch_scale_max))
 		pcp->free_count = min(pcp->free_count + (1 << order),
-				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
+				      batch << pcp_batch_scale_max);
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2964,7 +2966,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
 		 * subsequent allocation of order-0 pages without any freeing.
 		 */
 		if (batch <= max_nr_alloc &&
-		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
+		    pcp->alloc_factor < pcp_batch_scale_max)
 			pcp->alloc_factor++;
 		batch = min(batch, max_nr_alloc);
 	}
@@ -6341,6 +6343,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
 		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "pcp_batch_scale_max",
+		.data		= &pcp_batch_scale_max,
+		.maxlen		= sizeof(pcp_batch_scale_max),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &sysctl_6,
+	},
 	{
 		.procname	= "lowmem_reserve_ratio",
 		.data		= &sysctl_lowmem_reserve_ratio,
-- 
2.34.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-04  8:01 ` [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
@ 2024-08-05  1:38   ` Huang, Ying
  2024-08-05  1:58     ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, Ying @ 2024-08-05  1:38 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

[snip]

>
> Why introduce a systl knob?
> ===========================
>
> From the above data, it's clear that different CPU types have varying
> allocation latencies concerning zone->lock contention. Typically, people
> don't release individual kernel packages for each type of x86_64 CPU.
>
> Furthermore, for latency-insensitive applications, we can keep the default
> setting for better throughput.

Do you have any data to prove that the default setting is better for
throughput?  If so, that will be a strong support for your patch.

> In our production environment, we set this
> value to 0 for applications running on Kubernetes servers while keeping it
> at the default value of 5 for other applications like big data. It's not
> common to release individual kernel packages for each application.
>

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  1:38   ` Huang, Ying
@ 2024-08-05  1:58     ` Yafang Shao
  2024-08-05  3:02       ` Huang, Ying
  0 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-08-05  1:58 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> [snip]
>
> >
> > Why introduce a systl knob?
> > ===========================
> >
> > From the above data, it's clear that different CPU types have varying
> > allocation latencies concerning zone->lock contention. Typically, people
> > don't release individual kernel packages for each type of x86_64 CPU.
> >
> > Furthermore, for latency-insensitive applications, we can keep the default
> > setting for better throughput.
>
> Do you have any data to prove that the default setting is better for
> throughput?  If so, that will be a strong support for your patch.

No, I don't. The primary reason we can't change the default value from
5 to 0 across our fleet of servers is that you initially set it to 5.
The sysadmins believe you had a strong reason for setting it to 5 by
default; otherwise, it would be considered careless for the upstream
kernel. I also believe you must have had a solid justification for
setting the default value to 5; otherwise, why would you have
submitted your patches?

>
> > In our production environment, we set this
> > value to 0 for applications running on Kubernetes servers while keeping it
> > at the default value of 5 for other applications like big data. It's not
> > common to release individual kernel packages for each application.
> >
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying



--
Regards
Yafang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  1:58     ` Yafang Shao
@ 2024-08-05  3:02       ` Huang, Ying
  2024-08-05  3:17         ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, Ying @ 2024-08-05  3:02 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> [snip]
>>
>> >
>> > Why introduce a systl knob?
>> > ===========================
>> >
>> > From the above data, it's clear that different CPU types have varying
>> > allocation latencies concerning zone->lock contention. Typically, people
>> > don't release individual kernel packages for each type of x86_64 CPU.
>> >
>> > Furthermore, for latency-insensitive applications, we can keep the default
>> > setting for better throughput.
>>
>> Do you have any data to prove that the default setting is better for
>> throughput?  If so, that will be a strong support for your patch.
>
> No, I don't. The primary reason we can't change the default value from
> 5 to 0 across our fleet of servers is that you initially set it to 5.
> The sysadmins believe you had a strong reason for setting it to 5 by
> default; otherwise, it would be considered careless for the upstream
> kernel. I also believe you must have had a solid justification for
> setting the default value to 5; otherwise, why would you have
> submitted your patches?

In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
avoid too long latency"), I tried my best to run test on the machines
available with a micro-benchmark (will-it-scale/page_fault1) which
exercises kernel page allocator heavily.  From the data in commit,
larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
much.  The 99% alloc/free latency can be kept within about 100us with
CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.

But, we can always improve the default value with more data, on more
types of machines and with more types of benchmarks, etc.

Your data suggest smaller default value because you have data to show
that larger default value has the latency spike issue (as large as tens
ms) for some practical workloads.  Which weren't tested previously.  In
contrast, we don't have strong data to show the throughput advantages of
larger CONFIG_PCP_BATCH_SCALE_MAX value.

So, I suggest to use a smaller default value for
CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
choice.

>>
>> > In our production environment, we set this
>> > value to 0 for applications running on Kubernetes servers while keeping it
>> > at the default value of 5 for other applications like big data. It's not
>> > common to release individual kernel packages for each application.
>> >
>>
>> [snip]
>>

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  3:02       ` Huang, Ying
@ 2024-08-05  3:17         ` Yafang Shao
  2024-08-05  4:32           ` Huang, Ying
  0 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-08-05  3:17 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Why introduce a systl knob?
> >> > ===========================
> >> >
> >> > From the above data, it's clear that different CPU types have varying
> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >
> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> > setting for better throughput.
> >>
> >> Do you have any data to prove that the default setting is better for
> >> throughput?  If so, that will be a strong support for your patch.
> >
> > No, I don't. The primary reason we can't change the default value from
> > 5 to 0 across our fleet of servers is that you initially set it to 5.
> > The sysadmins believe you had a strong reason for setting it to 5 by
> > default; otherwise, it would be considered careless for the upstream
> > kernel. I also believe you must have had a solid justification for
> > setting the default value to 5; otherwise, why would you have
> > submitted your patches?
>
> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
> avoid too long latency"), I tried my best to run test on the machines
> available with a micro-benchmark (will-it-scale/page_fault1) which
> exercises kernel page allocator heavily.  From the data in commit,
> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
> much.  The 99% alloc/free latency can be kept within about 100us with
> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
>
> But, we can always improve the default value with more data, on more
> types of machines and with more types of benchmarks, etc.
>
> Your data suggest smaller default value because you have data to show
> that larger default value has the latency spike issue (as large as tens
> ms) for some practical workloads.  Which weren't tested previously.  In
> contrast, we don't have strong data to show the throughput advantages of
> larger CONFIG_PCP_BATCH_SCALE_MAX value.
>
> So, I suggest to use a smaller default value for
> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
> choice.

Which smaller default value would be better? How can we ensure that
other workloads, which we haven't tested, will work well with this new
default value? If you have a better default value in mind, would you
consider sending a patch for it? I would be happy to test it with my
test case.


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  3:17         ` Yafang Shao
@ 2024-08-05  4:32           ` Huang, Ying
  2024-08-05  4:48             ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, Ying @ 2024-08-05  4:32 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> [snip]
>> >>
>> >> >
>> >> > Why introduce a systl knob?
>> >> > ===========================
>> >> >
>> >> > From the above data, it's clear that different CPU types have varying
>> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >
>> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> > setting for better throughput.
>> >>
>> >> Do you have any data to prove that the default setting is better for
>> >> throughput?  If so, that will be a strong support for your patch.
>> >
>> > No, I don't. The primary reason we can't change the default value from
>> > 5 to 0 across our fleet of servers is that you initially set it to 5.
>> > The sysadmins believe you had a strong reason for setting it to 5 by
>> > default; otherwise, it would be considered careless for the upstream
>> > kernel. I also believe you must have had a solid justification for
>> > setting the default value to 5; otherwise, why would you have
>> > submitted your patches?
>>
>> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
>> avoid too long latency"), I tried my best to run test on the machines
>> available with a micro-benchmark (will-it-scale/page_fault1) which
>> exercises kernel page allocator heavily.  From the data in commit,
>> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
>> much.  The 99% alloc/free latency can be kept within about 100us with
>> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
>>
>> But, we can always improve the default value with more data, on more
>> types of machines and with more types of benchmarks, etc.
>>
>> Your data suggest smaller default value because you have data to show
>> that larger default value has the latency spike issue (as large as tens
>> ms) for some practical workloads.  Which weren't tested previously.  In
>> contrast, we don't have strong data to show the throughput advantages of
>> larger CONFIG_PCP_BATCH_SCALE_MAX value.
>>
>> So, I suggest to use a smaller default value for
>> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
>> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
>> choice.
>
> Which smaller default value would be better?

This depends on further test results.

> How can we ensure that other workloads, which we haven't tested, will
> work well with this new default value?

We cannot.  We can only depends on the data available.  If there are
new data available in the future, we can make the change accordingly.

> If you have a better default value in mind, would you consider sending
> a patch for it? I would be happy to test it with my test case.

If you can test the value 1, 2, 3, and 4 with your workload, that will
be very helpful!  Both allocation latency and total free time (if
possible) are valuable.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  4:32           ` Huang, Ying
@ 2024-08-05  4:48             ` Yafang Shao
  2024-08-05  5:00               ` Huang, Ying
  0 siblings, 1 reply; 12+ messages in thread
From: Yafang Shao @ 2024-08-05  4:48 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >
> >> >> > Why introduce a systl knob?
> >> >> > ===========================
> >> >> >
> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >
> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> > setting for better throughput.
> >> >>
> >> >> Do you have any data to prove that the default setting is better for
> >> >> throughput?  If so, that will be a strong support for your patch.
> >> >
> >> > No, I don't. The primary reason we can't change the default value from
> >> > 5 to 0 across our fleet of servers is that you initially set it to 5.
> >> > The sysadmins believe you had a strong reason for setting it to 5 by
> >> > default; otherwise, it would be considered careless for the upstream
> >> > kernel. I also believe you must have had a solid justification for
> >> > setting the default value to 5; otherwise, why would you have
> >> > submitted your patches?
> >>
> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
> >> avoid too long latency"), I tried my best to run test on the machines
> >> available with a micro-benchmark (will-it-scale/page_fault1) which
> >> exercises kernel page allocator heavily.  From the data in commit,
> >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
> >> much.  The 99% alloc/free latency can be kept within about 100us with
> >> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
> >>
> >> But, we can always improve the default value with more data, on more
> >> types of machines and with more types of benchmarks, etc.
> >>
> >> Your data suggest smaller default value because you have data to show
> >> that larger default value has the latency spike issue (as large as tens
> >> ms) for some practical workloads.  Which weren't tested previously.  In
> >> contrast, we don't have strong data to show the throughput advantages of
> >> larger CONFIG_PCP_BATCH_SCALE_MAX value.
> >>
> >> So, I suggest to use a smaller default value for
> >> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
> >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
> >> choice.
> >
> > Which smaller default value would be better?
>
> This depends on further test results.

I believe you agree with me that you can't test all workloads.

>
> > How can we ensure that other workloads, which we haven't tested, will
> > work well with this new default value?
>
> We cannot.  We can only depends on the data available.  If there are
> new data available in the future, we can make the change accordingly.

So, your solution is to change the hardcoded value for untested
workloads and then release the kernel package again?

>
> > If you have a better default value in mind, would you consider sending
> > a patch for it? I would be happy to test it with my test case.
>
> If you can test the value 1, 2, 3, and 4 with your workload, that will
> be very helpful!  Both allocation latency and total free time (if
> possible) are valuable.

You know I can't verify it with all workloads, right?
You have so much data to verify, which indicates uncertainty about any
default value. Why not make it tunable and let the user choose the
value they prefer?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  4:48             ` Yafang Shao
@ 2024-08-05  5:00               ` Huang, Ying
  2024-08-05  5:36                 ` Yafang Shao
  0 siblings, 1 reply; 12+ messages in thread
From: Huang, Ying @ 2024-08-05  5:00 UTC (permalink / raw)
  To: Yafang Shao; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

Yafang Shao <laoar.shao@gmail.com> writes:

> On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> [snip]
>> >> >>
>> >> >> >
>> >> >> > Why introduce a systl knob?
>> >> >> > ===========================
>> >> >> >
>> >> >> > From the above data, it's clear that different CPU types have varying
>> >> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >> >
>> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> >> > setting for better throughput.
>> >> >>
>> >> >> Do you have any data to prove that the default setting is better for
>> >> >> throughput?  If so, that will be a strong support for your patch.
>> >> >
>> >> > No, I don't. The primary reason we can't change the default value from
>> >> > 5 to 0 across our fleet of servers is that you initially set it to 5.
>> >> > The sysadmins believe you had a strong reason for setting it to 5 by
>> >> > default; otherwise, it would be considered careless for the upstream
>> >> > kernel. I also believe you must have had a solid justification for
>> >> > setting the default value to 5; otherwise, why would you have
>> >> > submitted your patches?
>> >>
>> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
>> >> avoid too long latency"), I tried my best to run test on the machines
>> >> available with a micro-benchmark (will-it-scale/page_fault1) which
>> >> exercises kernel page allocator heavily.  From the data in commit,
>> >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
>> >> much.  The 99% alloc/free latency can be kept within about 100us with
>> >> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
>> >>
>> >> But, we can always improve the default value with more data, on more
>> >> types of machines and with more types of benchmarks, etc.
>> >>
>> >> Your data suggest smaller default value because you have data to show
>> >> that larger default value has the latency spike issue (as large as tens
>> >> ms) for some practical workloads.  Which weren't tested previously.  In
>> >> contrast, we don't have strong data to show the throughput advantages of
>> >> larger CONFIG_PCP_BATCH_SCALE_MAX value.
>> >>
>> >> So, I suggest to use a smaller default value for
>> >> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
>> >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
>> >> choice.
>> >
>> > Which smaller default value would be better?
>>
>> This depends on further test results.
>
> I believe you agree with me that you can't test all workloads.
>
>>
>> > How can we ensure that other workloads, which we haven't tested, will
>> > work well with this new default value?
>>
>> We cannot.  We can only depends on the data available.  If there are
>> new data available in the future, we can make the change accordingly.
>
> So, your solution is to change the hardcoded value for untested
> workloads and then release the kernel package again?
>
>>
>> > If you have a better default value in mind, would you consider sending
>> > a patch for it? I would be happy to test it with my test case.
>>
>> If you can test the value 1, 2, 3, and 4 with your workload, that will
>> be very helpful!  Both allocation latency and total free time (if
>> possible) are valuable.
>
> You know I can't verify it with all workloads, right?
> You have so much data to verify, which indicates uncertainty about any
> default value. Why not make it tunable and let the user choose the
> value they prefer?

We only make decision based on data available.  In theory, we cannot
test all workloads, because there will be new workloads in the future.
If we have data to show that smaller value will cause performance
regressions for some reasonable workloads, we can make it user tunable.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
  2024-08-05  5:00               ` Huang, Ying
@ 2024-08-05  5:36                 ` Yafang Shao
  0 siblings, 0 replies; 12+ messages in thread
From: Yafang Shao @ 2024-08-05  5:36 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, mgorman, linux-mm, Matthew Wilcox, David Rientjes

On Mon, Aug 5, 2024 at 1:04 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Mon, Aug 5, 2024 at 12:36 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> [snip]
> >> >> >>
> >> >> >> >
> >> >> >> > Why introduce a systl knob?
> >> >> >> > ===========================
> >> >> >> >
> >> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >> >
> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> >> > setting for better throughput.
> >> >> >>
> >> >> >> Do you have any data to prove that the default setting is better for
> >> >> >> throughput?  If so, that will be a strong support for your patch.
> >> >> >
> >> >> > No, I don't. The primary reason we can't change the default value from
> >> >> > 5 to 0 across our fleet of servers is that you initially set it to 5.
> >> >> > The sysadmins believe you had a strong reason for setting it to 5 by
> >> >> > default; otherwise, it would be considered careless for the upstream
> >> >> > kernel. I also believe you must have had a solid justification for
> >> >> > setting the default value to 5; otherwise, why would you have
> >> >> > submitted your patches?
> >> >>
> >> >> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
> >> >> avoid too long latency"), I tried my best to run test on the machines
> >> >> available with a micro-benchmark (will-it-scale/page_fault1) which
> >> >> exercises kernel page allocator heavily.  From the data in commit,
> >> >> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
> >> >> much.  The 99% alloc/free latency can be kept within about 100us with
> >> >> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
> >> >>
> >> >> But, we can always improve the default value with more data, on more
> >> >> types of machines and with more types of benchmarks, etc.
> >> >>
> >> >> Your data suggest smaller default value because you have data to show
> >> >> that larger default value has the latency spike issue (as large as tens
> >> >> ms) for some practical workloads.  Which weren't tested previously.  In
> >> >> contrast, we don't have strong data to show the throughput advantages of
> >> >> larger CONFIG_PCP_BATCH_SCALE_MAX value.
> >> >>
> >> >> So, I suggest to use a smaller default value for
> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
> >> >> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
> >> >> choice.
> >> >
> >> > Which smaller default value would be better?
> >>
> >> This depends on further test results.
> >
> > I believe you agree with me that you can't test all workloads.
> >
> >>
> >> > How can we ensure that other workloads, which we haven't tested, will
> >> > work well with this new default value?
> >>
> >> We cannot.  We can only depends on the data available.  If there are
> >> new data available in the future, we can make the change accordingly.
> >
> > So, your solution is to change the hardcoded value for untested
> > workloads and then release the kernel package again?
> >
> >>
> >> > If you have a better default value in mind, would you consider sending
> >> > a patch for it? I would be happy to test it with my test case.
> >>
> >> If you can test the value 1, 2, 3, and 4 with your workload, that will
> >> be very helpful!  Both allocation latency and total free time (if
> >> possible) are valuable.
> >
> > You know I can't verify it with all workloads, right?
> > You have so much data to verify, which indicates uncertainty about any
> > default value. Why not make it tunable and let the user choose the
> > value they prefer?
>
> We only make decision based on data available.  In theory, we cannot
> test all workloads, because there will be new workloads in the future.
> If we have data to show that smaller value will cause performance
> regressions for some reasonable workloads, we can make it user tunable.

The issue arises when a new workload is discovered; you have to
release a new kernel package for it. If that's your expectation, why
not make it tunable from the start? Had you made it tunable in your
original commit, we wouldn't be having this non-intuitive discussion
repeatedly.

Which came first, the chicken or the egg?

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-05  5:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-04  8:01 [PATCH v3 0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-08-04  8:01 ` [PATCH v3 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-08-04  8:01 ` [PATCH v3 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-08-04  8:01 ` [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-08-05  1:38   ` Huang, Ying
2024-08-05  1:58     ` Yafang Shao
2024-08-05  3:02       ` Huang, Ying
2024-08-05  3:17         ` Yafang Shao
2024-08-05  4:32           ` Huang, Ying
2024-08-05  4:48             ` Yafang Shao
2024-08-05  5:00               ` Huang, Ying
2024-08-05  5:36                 ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox