linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org,  mgorman@techsingularity.net,
	linux-mm@kvack.org,  Matthew Wilcox <willy@infradead.org>,
	 David Rientjes <rientjes@google.com>
Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
Date: Wed, 10 Jul 2024 10:49:21 +0800	[thread overview]
Message-ID: <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20240707094956.94654-4-laoar.shao@gmail.com> (Yafang Shao's message of "Sun, 7 Jul 2024 17:49:56 +0800")

Yafang Shao <laoar.shao@gmail.com> writes:

> The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
> quickly experimenting with specific workloads in a production environment,
> particularly when monitoring latency spikes caused by contention on the
> zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
> is introduced as a more practical alternative.

In general, I'm neutral to the change.  I can understand that kernel
configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
too.

> To ultimately mitigate the zone->lock contention issue, several suggestions
> have been proposed. One approach involves dividing large zones into multi
> smaller zones, as suggested by Matthew[0], while another entails splitting
> the zone->lock using a mechanism similar to memory arenas and shifting away
> from relying solely on zone_id to identify the range of free lists a
> particular page belongs to[1]. However, implementing these solutions is
> likely to necessitate a more extended development effort.

Per my understanding, the change will hurt instead of improve zone->lock
contention.  Instead, it will reduce page allocation/freeing latency.

> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0]
> Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1]
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++
>  include/linux/sysctl.h                  |  1 +
>  kernel/sysctl.c                         |  2 +-
>  mm/Kconfig                              | 11 -----------
>  mm/page_alloc.c                         | 22 ++++++++++++++++------
>  5 files changed, 33 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index e86c968a7a0e..eb9e5216eefe 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm:
>  - page_lock_unfairness
>  - panic_on_oom
>  - percpu_pagelist_high_fraction
> +- pcp_batch_scale_max
>  - stat_interval
>  - stat_refresh
>  - numa_stat
> @@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local
>  online CPUs.  If the user writes '0' to this sysctl, it will revert to
>  this default behavior.
>  
> +pcp_batch_scale_max
> +===================
> +
> +In page allocator, PCP (Per-CPU pageset) is refilled and drained in
> +batches.  The batch number is scaled automatically to improve page
> +allocation/free throughput.  But too large scale factor may hurt
> +latency.  This option sets the upper limit of scale factor to limit
> +the maximum latency.
> +
> +The range for this parameter spans from 0 to 6, with a default value of 5.
> +The value assigned to 'N' signifies that during each refilling or draining
> +process, a maximum of (batch << N) pages will be involved, where "batch"
> +represents the default batch size automatically computed by the kernel for
> +each zone.
>  
>  stat_interval
>  =============
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 09db2f2e6488..fb797f1c0ef7 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -52,6 +52,7 @@ struct ctl_dir;
>  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
>  #define SYSCTL_MAXOLDUID		((void *)&sysctl_vals[10])
>  #define SYSCTL_NEG_ONE			((void *)&sysctl_vals[11])
> +#define SYSCTL_SIX			((void *)&sysctl_vals[12])
>  
>  extern const int sysctl_vals[];
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e0b917328cf9..430ac4f58eb7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -82,7 +82,7 @@
>  #endif
>  
>  /* shared constants to be used in various sysctls */
> -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 };
> +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 };
>  EXPORT_SYMBOL(sysctl_vals);
>  
>  const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX };
> diff --git a/mm/Kconfig b/mm/Kconfig
> index b4cb45255a54..41fe4c13b7ac 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE
>  config CONTIG_ALLOC
>  	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
>  
> -config PCP_BATCH_SCALE_MAX
> -	int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free"
> -	default 5
> -	range 0 6
> -	help
> -	  In page allocator, PCP (Per-CPU pageset) is refilled and drained in
> -	  batches.  The batch number is scaled automatically to improve page
> -	  allocation/free throughput.  But too large scale factor may hurt
> -	  latency.  This option sets the upper limit of scale factor to limit
> -	  the maximum latency.
> -
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2b76754a48e0..703eec22a997 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -273,6 +273,7 @@ int min_free_kbytes = 1024;
>  int user_min_free_kbytes = -1;
>  static int watermark_boost_factor __read_mostly = 15000;
>  static int watermark_scale_factor = 10;
> +static int pcp_batch_scale_max = 5;
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
>  	int count = READ_ONCE(pcp->count);
>  
>  	while (count) {
> -		int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +		int to_drain = min(count, pcp->batch << pcp_batch_scale_max);
>  		count -= to_drain;
>  
>  		spin_lock(&pcp->lock);
> @@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
>  
>  	/* Free as much as possible if batch freeing high-order pages. */
>  	if (unlikely(free_high))
> -		return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +		return min(pcp->count, batch << pcp_batch_scale_max);
>  
>  	/* Check for PCP disabled or boot pageset */
>  	if (unlikely(high < batch))
> @@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>  		return 0;
>  
>  	if (unlikely(free_high)) {
> -		pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX),
> +		pcp->high = max(high - (batch << pcp_batch_scale_max),
>  				high_min);
>  		return 0;
>  	}
> @@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>  	} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
>  		pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
>  	}
> -	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
> +	if (pcp->free_count < (batch << pcp_batch_scale_max))
>  		pcp->free_count = min(pcp->free_count + (1 << order),
> -				      batch << CONFIG_PCP_BATCH_SCALE_MAX);
> +				      batch << pcp_batch_scale_max);
>  	high = nr_pcp_high(pcp, zone, batch, free_high);
>  	if (pcp->count >= high) {
>  		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> @@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>  		 * subsequent allocation of order-0 pages without any freeing.
>  		 */
>  		if (batch <= max_nr_alloc &&
> -		    pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX)
> +		    pcp->alloc_factor < pcp_batch_scale_max)
>  			pcp->alloc_factor++;
>  		batch = min(batch, max_nr_alloc);
>  	}
> @@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = {
>  		.proc_handler	= percpu_pagelist_high_fraction_sysctl_handler,
>  		.extra1		= SYSCTL_ZERO,
>  	},
> +	{
> +		.procname	= "pcp_batch_scale_max",
> +		.data		= &pcp_batch_scale_max,
> +		.maxlen		= sizeof(pcp_batch_scale_max),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_SIX,
> +	},
>  	{
>  		.procname	= "lowmem_reserve_ratio",
>  		.data		= &sysctl_lowmem_reserve_ratio,

--
Best Regards,
Huang, Ying


  reply	other threads:[~2024-07-10  2:51 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-07  9:49 [PATCH 0/3] " Yafang Shao
2024-07-07  9:49 ` [PATCH 1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count Yafang Shao
2024-07-10  1:52   ` Huang, Ying
2024-07-07  9:49 ` [PATCH 2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX Yafang Shao
2024-07-10  1:51   ` Huang, Ying
2024-07-10  2:07     ` Yafang Shao
2024-07-07  9:49 ` [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Yafang Shao
2024-07-10  2:49   ` Huang, Ying [this message]
2024-07-11  2:21     ` Yafang Shao
2024-07-11  6:42       ` Huang, Ying
2024-07-11  7:25         ` Yafang Shao
2024-07-11  8:18           ` Huang, Ying
2024-07-11  9:51             ` Yafang Shao
2024-07-11 10:49               ` Huang, Ying
2024-07-11 12:45                 ` Yafang Shao
2024-07-12  1:19                   ` Huang, Ying
2024-07-12  2:25                     ` Yafang Shao
2024-07-12  3:05                       ` Huang, Ying
2024-07-12  3:44                         ` Yafang Shao
2024-07-12  5:25                           ` Huang, Ying
2024-07-12  5:41                             ` Yafang Shao
2024-07-12  6:16                               ` Huang, Ying
2024-07-12  6:41                                 ` Yafang Shao
2024-07-12  7:04                                   ` Huang, Ying
2024-07-12  7:36                                     ` Yafang Shao
2024-07-12  8:24                                       ` Huang, Ying
2024-07-12  8:49                                         ` Yafang Shao
2024-07-12  9:10                                           ` Huang, Ying
2024-07-12  9:24                                             ` Yafang Shao
2024-07-12  9:46                                               ` Yafang Shao
2024-07-15  1:09                                                 ` Huang, Ying
2024-07-15  4:32                                                   ` Yafang Shao
2024-07-10  3:00 ` [PATCH 0/3] " Huang, Ying
2024-07-11  2:25   ` Yafang Shao
2024-07-11  6:38     ` Huang, Ying
2024-07-11  7:21       ` Yafang Shao
2024-07-11  8:36         ` Huang, Ying
2024-07-11  9:40           ` Yafang Shao
2024-07-11 11:03             ` Huang, Ying
2024-07-11 12:40               ` Yafang Shao
2024-07-12  2:32                 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox