linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: akpm@linux-foundation.org
Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist
Date: Mon,  1 Jul 2024 22:20:46 +0800	[thread overview]
Message-ID: <20240701142046.6050-1-laoar.shao@gmail.com> (raw)

Currently, we're encountering latency spikes in our container environment
when a specific container with multiple Python-based tasks exits. These
tasks may hold the zone->lock for an extended period, significantly
impacting latency for other containers attempting to allocate memory.

As a workaround, we've found that minimizing the pagelist size, such as
setting it to 4 times the batch size, can help mitigate these spikes.
However, managing vm.percpu_pagelist_high_fraction across a large fleet of
servers poses challenges due to variations in CPU counts, NUMA nodes, and
physical memory capacities.

To enhance practicality, we propose allowing the setting of -1 for
vm.percpu_pagelist_high_fraction to designate a minimum pagelist size.

Furthermore, considering the challenges associated with utilizing
vm.percpu_pagelist_high_fraction, it would be beneficial to introduce a
more intuitive parameter, vm.percpu_pagelist_high_size, that would permit
direct specification of the pagelist size as a multiple of the batch size.
This methodology would mirror the functionality of vm.dirty_ratio and
vm.dirty_bytes, providing users with greater flexibility and control.

We have discussed the possibility of introducing multiple small zones to
mitigate the contention on the zone->lock[0], but this approach is likely
to require a longer-term implementation effort.

Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/admin-guide/sysctl/vm.rst | 4 ++++
 mm/page_alloc.c                         | 8 ++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e86c968a7a0e..1f535d022cda 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the value of hot per-cpu
 page lists. A user can specify a number like 100 to allocate 1/100th of
 each zone between per-cpu lists.
 
+The minimum number of pages that can be stored in per-CPU page lists is
+four times the batch value. By writing '-1' to this sysctl, you can set
+this minimum value.
+
 The batch value of each per-cpu page list remains the same regardless of
 the value of the high fraction so allocation latencies are unaffected.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5675ca..e7313f9d704b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online,
 	int nr_split_cpus;
 	unsigned long total_pages;
 
+	/* Setting -1 to set the minimum pagelist size, four times the batch size */
+	if (high_fraction == -1)
+		return batch << 2;
+
 	if (!high_fraction) {
 		/*
 		 * By default, the high value of the pcp is based on the zone
@@ -6192,7 +6196,8 @@ static int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
 
 	/* Sanity checking to avoid pcp imbalance */
 	if (percpu_pagelist_high_fraction &&
-	    percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) {
+	    percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION &&
+	    percpu_pagelist_high_fraction != -1) {
 		percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction;
 		ret = -EINVAL;
 		goto out;
@@ -6241,7 +6246,6 @@ static struct ctl_table page_alloc_sysctl_table[] = {
 		.maxlen		= sizeof(percpu_pagelist_high_fraction),
 		.mode		= 0644,
 		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
-		.extra1		= SYSCTL_ZERO,
 	},
 	{
 		.procname	= "lowmem_reserve_ratio",
-- 
2.43.5



             reply	other threads:[~2024-07-01 14:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-01 14:20 Yafang Shao [this message]
2024-07-02  2:51 ` Andrew Morton
2024-07-02  6:37   ` Yafang Shao
2024-07-02  9:08     ` Huang, Ying
2024-07-02 12:07       ` Yafang Shao
2024-07-03  1:55         ` Huang, Ying
2024-07-03  2:13           ` Yafang Shao
2024-07-03  3:21             ` Huang, Ying
2024-07-03  3:44               ` Yafang Shao
2024-07-03  5:34                 ` Huang, Ying
2024-07-04 13:27                   ` Yafang Shao
2024-07-05  1:28                     ` Huang, Ying
2024-07-05  3:03                       ` Yafang Shao
2024-07-05  5:31                         ` Huang, Ying
2024-07-05 13:09   ` Mel Gorman
2024-07-02  7:23 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240701142046.6050-1-laoar.shao@gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox