From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84AC9C2BD09 for ; Wed, 10 Jul 2024 02:51:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F10EB6B0089; Tue, 9 Jul 2024 22:51:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC0126B008A; Tue, 9 Jul 2024 22:51:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DAE826B008C; Tue, 9 Jul 2024 22:51:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BB8B36B0089 for ; Tue, 9 Jul 2024 22:51:19 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 327D941AD7 for ; Wed, 10 Jul 2024 02:51:19 +0000 (UTC) X-FDA: 82322316678.19.B084F36 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) by imf28.hostedemail.com (Postfix) with ESMTP id B8C4CC0006 for ; Wed, 10 Jul 2024 02:51:15 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XecjXLc9; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720579846; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qRFfQYhvwvuTa1wCH2p+TdbZv4L5eemy6Wk5dxaAp6o=; b=yYhpYuxW35eIdDuT8acWcIP+LFoHvxPemtccs6wnbdkVJGf8moyokUdn65Jb5lVqhUb5fU M6BFJfSxo6A6hBQkSMQcNj6KaKPC/NUOik3nNFQnxLcXKBpExXItk3cLCjZ8eRzZSBYSqB dBbmgW0kPRZWIWWE3j7L6E2lUZ8OAqg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720579846; a=rsa-sha256; cv=none; b=KhirBT4IOh/oRiLoimubicYMoxO+1x1WXtuOnHvmXxV6MD6EXPHcaStSzrZZlaTtWCBqZm pwugPo1FnzoGMn9Z+tx6ia5D7SIXJHNdb0h+Eyjaq9NHcmJGT8v1LhvlOV3nsmrXU/QOTJ b2zQ8huEVua0dvTLmV+OBeQoRmkVcSo= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=XecjXLc9; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.12 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720579876; x=1752115876; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=MREfsWkC83lq1jFKawSa5Rp8EDMFMelL5PdjmCVxU7w=; b=XecjXLc9K8DdY7FsByxAKrtARVobS9rlUFv2yqAeYuiJmWMZs3uL7zxI BAG5fVXAL4fikKugy2iIviaFWDv/za+FT17fG82oUsxal4p4gT2CX+oXM 307U426VChE6t7DGWs2socdQYTkKmRXIlSypMOf1ioZ/IGmjyG9zi1+2o suk5kx0dFkUsdcbmCBvAE1uG+1RKCsUFx0pNOXIRdV9eLY/0Dgp3d54IN ftimTnPBvfO00Oo8rFsrOWy/aNMcX/Ivj8Avq3e1UKwOxgWYFdPHI0Bwn SGEdB7L/f5xPeDaWNYwyMbTCv/XFxQp2g8N26vRw1p37uzsyk7+R7As/G w==; X-CSE-ConnectionGUID: R20ngpOhRmysVem74c1O/g== X-CSE-MsgGUID: R/hE6jOSQ8ijj7bmQiyLNw== X-IronPort-AV: E=McAfee;i="6700,10204,11128"; a="29274572" X-IronPort-AV: E=Sophos;i="6.09,196,1716274800"; d="scan'208";a="29274572" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2024 19:51:14 -0700 X-CSE-ConnectionGUID: RjA+z8MbQH2ac2cCivXexQ== X-CSE-MsgGUID: l4X6MBOURKGUOiAz3yvo4Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,196,1716274800"; d="scan'208";a="48792661" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2024 19:51:13 -0700 From: "Huang, Ying" To: Yafang Shao Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org, Matthew Wilcox , David Rientjes Subject: Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max In-Reply-To: <20240707094956.94654-4-laoar.shao@gmail.com> (Yafang Shao's message of "Sun, 7 Jul 2024 17:49:56 +0800") References: <20240707094956.94654-1-laoar.shao@gmail.com> <20240707094956.94654-4-laoar.shao@gmail.com> Date: Wed, 10 Jul 2024 10:49:21 +0800 Message-ID: <878qyaarm6.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: B8C4CC0006 X-Stat-Signature: k63uzt9pfpwehh7nixqochr85aihpeoh X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1720579875-935142 X-HE-Meta: U2FsdGVkX1/9QkFG8cQK2J1ONJ6IX8BQOxHnVnK0o9QmtC/dYzgYYzE3AdDMXUNDu7JI7KkJpRb7SJxX7LI/MDI0PwHGyq0IPxkFvlDqxrqyNLIorFictWtNxLplA9hZX3HKlC7zIt0pNBbobnz3xZ84PzM5EWUK4HINF0xNXlToYAkoowg31TbG9l50pRFes3eXT4O4k1pBj5bvSAYQ+SUc0wTNxRKZUZ5gn57hoYwM9Xz+sgyNmaGo8iLDiuzFvkLUkvkC2d8M68egK8dZ58njulLOY4qyuD7nx2VnAktvhc84entw8RjVHc7Xsh1Y0KbojkTqZT6oY2ROhreblQs20ek1Xoz9gEkwC0CfK9tNlmdf+n6sXPE6YpMmoUNCEXm4rbIppBZBrNOR97ZCDy9ISNHBFGnT+Ypc8KWT3PBGX+Q6CC10plpSDjfAU9yWBFckOgw3CTGKxSjhkdTPoqpAG4zdtPJcn432QYj3F2tjdRqN2ESqYT0lZjM2foGZyd2wofgLp00w5RWyRB+/1LnWZkG/MsVqAZ+lydsjAdR18+7a5IjmGK3vl1JHLStW6DF7g9eYlWqi95KqlhTaRSHx/Kt7vCTmB5N/fQmIGII/KF58mFgiH1tzU2+82U8vrwL4uKQkLVcHoTrVxKs+mBO++hYGmqS0F/a5uHCGKQaqOE4uW46mK2SHw3WTHa+g8sJ3BDitHXmfI+grpSG+z0eO2QfoYG/Zf8hZ6Vpm4vfz2E8KPg0Y1Czuneys2vBoxr+RumrZqwpgG2NwIwZpC2SDNT6slYQ7iBkG8n4K3+7mfz6AeKrKdvthOOt71odghy2jcVRGL0zqhOU1paESbag8iFmsnT+w/whhMTwGTyOy0dNuKnTs5UKJAtx53owqkFNnImYfVMBuMcFs3FAlLbNoHK1oDT9vZoNyhBl43udNdib4wCWIu45TPzELjmMvmHMDXlzASaM//tLC7NR h99Dvw/d 5H85bRiKLKQRgFmlHsSAr3RGTPg28Y66vs84q0ShGIcwO48Z5Vbs0XByan6qODOc4j1tDRVTZLxGtzfVYep8NW1ieNyk148qs0r0h5pDD7siX9syiScUCEh9TgIgf/x+Pujzawg6JGwPZFvKZb1WlwKqvedafu5LlQr0rB9SgniTNGj0X+0u2t6K6031ec7Pjg2srKPzq7StHQyO60UiL/CL4GH5BofBhneUpSrOyc+Wa26SNQK0eMYKWzgXpUUDIPVR3gzJSJi7N45ruEvPyMZUzfFBPYz1Gg2bdT4LT7qPOGjUKdDA7wDHyVEdw3wXL3BKQ2rn4zrhAq5WR+jK1vYLmuhjhSzQnJLjHzL2TZJEXuF39wdl8elbanFsrIEvpaRhJ6vFkqmDYqewSmhoiejCNaeqpUCyufv6A2lpkPwUDtyd4oXGzT0820WZ9ufH04iXjsVaTwTCTntBbYk+ZGKQx5v0x0aHWUhHXoiuhi58IIvKN7NP6rRRUWA+QgM1QcLdb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > quickly experimenting with specific workloads in a production environment, > particularly when monitoring latency spikes caused by contention on the > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > is introduced as a more practical alternative. In general, I'm neutral to the change. I can understand that kernel configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI too. > To ultimately mitigate the zone->lock contention issue, several suggestions > have been proposed. One approach involves dividing large zones into multi > smaller zones, as suggested by Matthew[0], while another entails splitting > the zone->lock using a mechanism similar to memory arenas and shifting away > from relying solely on zone_id to identify the range of free lists a > particular page belongs to[1]. However, implementing these solutions is > likely to necessitate a more extended development effort. Per my understanding, the change will hurt instead of improve zone->lock contention. Instead, it will reduce page allocation/freeing latency. > Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [0] > Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [1] > Signed-off-by: Yafang Shao > Cc: "Huang, Ying" > Cc: Mel Gorman > Cc: Matthew Wilcox > Cc: David Rientjes > --- > Documentation/admin-guide/sysctl/vm.rst | 15 +++++++++++++++ > include/linux/sysctl.h | 1 + > kernel/sysctl.c | 2 +- > mm/Kconfig | 11 ----------- > mm/page_alloc.c | 22 ++++++++++++++++------ > 5 files changed, 33 insertions(+), 18 deletions(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index e86c968a7a0e..eb9e5216eefe 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -66,6 +66,7 @@ Currently, these files are in /proc/sys/vm: > - page_lock_unfairness > - panic_on_oom > - percpu_pagelist_high_fraction > +- pcp_batch_scale_max > - stat_interval > - stat_refresh > - numa_stat > @@ -864,6 +865,20 @@ mark based on the low watermark for the zone and the number of local > online CPUs. If the user writes '0' to this sysctl, it will revert to > this default behavior. > > +pcp_batch_scale_max > +=================== > + > +In page allocator, PCP (Per-CPU pageset) is refilled and drained in > +batches. The batch number is scaled automatically to improve page > +allocation/free throughput. But too large scale factor may hurt > +latency. This option sets the upper limit of scale factor to limit > +the maximum latency. > + > +The range for this parameter spans from 0 to 6, with a default value of 5. > +The value assigned to 'N' signifies that during each refilling or draining > +process, a maximum of (batch << N) pages will be involved, where "batch" > +represents the default batch size automatically computed by the kernel for > +each zone. > > stat_interval > ============= > diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h > index 09db2f2e6488..fb797f1c0ef7 100644 > --- a/include/linux/sysctl.h > +++ b/include/linux/sysctl.h > @@ -52,6 +52,7 @@ struct ctl_dir; > /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */ > #define SYSCTL_MAXOLDUID ((void *)&sysctl_vals[10]) > #define SYSCTL_NEG_ONE ((void *)&sysctl_vals[11]) > +#define SYSCTL_SIX ((void *)&sysctl_vals[12]) > > extern const int sysctl_vals[]; > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index e0b917328cf9..430ac4f58eb7 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -82,7 +82,7 @@ > #endif > > /* shared constants to be used in various sysctls */ > -const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1 }; > +const int sysctl_vals[] = { 0, 1, 2, 3, 4, 100, 200, 1000, 3000, INT_MAX, 65535, -1, 6 }; > EXPORT_SYMBOL(sysctl_vals); > > const unsigned long sysctl_long_vals[] = { 0, 1, LONG_MAX }; > diff --git a/mm/Kconfig b/mm/Kconfig > index b4cb45255a54..41fe4c13b7ac 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE > config CONTIG_ALLOC > def_bool (MEMORY_ISOLATION && COMPACTION) || CMA > > -config PCP_BATCH_SCALE_MAX > - int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" > - default 5 > - range 0 6 > - help > - In page allocator, PCP (Per-CPU pageset) is refilled and drained in > - batches. The batch number is scaled automatically to improve page > - allocation/free throughput. But too large scale factor may hurt > - latency. This option sets the upper limit of scale factor to limit > - the maximum latency. > - > config PHYS_ADDR_T_64BIT > def_bool 64BIT > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2b76754a48e0..703eec22a997 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -273,6 +273,7 @@ int min_free_kbytes = 1024; > int user_min_free_kbytes = -1; > static int watermark_boost_factor __read_mostly = 15000; > static int watermark_scale_factor = 10; > +static int pcp_batch_scale_max = 5; > > /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ > int movable_zone; > @@ -2310,7 +2311,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) > int count = READ_ONCE(pcp->count); > > while (count) { > - int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); > + int to_drain = min(count, pcp->batch << pcp_batch_scale_max); > count -= to_drain; > > spin_lock(&pcp->lock); > @@ -2438,7 +2439,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free > > /* Free as much as possible if batch freeing high-order pages. */ > if (unlikely(free_high)) > - return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); > + return min(pcp->count, batch << pcp_batch_scale_max); > > /* Check for PCP disabled or boot pageset */ > if (unlikely(high < batch)) > @@ -2470,7 +2471,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, > return 0; > > if (unlikely(free_high)) { > - pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), > + pcp->high = max(high - (batch << pcp_batch_scale_max), > high_min); > return 0; > } > @@ -2540,9 +2541,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, > } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { > pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; > } > - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) > + if (pcp->free_count < (batch << pcp_batch_scale_max)) > pcp->free_count = min(pcp->free_count + (1 << order), > - batch << CONFIG_PCP_BATCH_SCALE_MAX); > + batch << pcp_batch_scale_max); > high = nr_pcp_high(pcp, zone, batch, free_high); > if (pcp->count >= high) { > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), > @@ -2884,7 +2885,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) > * subsequent allocation of order-0 pages without any freeing. > */ > if (batch <= max_nr_alloc && > - pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) > + pcp->alloc_factor < pcp_batch_scale_max) > pcp->alloc_factor++; > batch = min(batch, max_nr_alloc); > } > @@ -6251,6 +6252,15 @@ static struct ctl_table page_alloc_sysctl_table[] = { > .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, > .extra1 = SYSCTL_ZERO, > }, > + { > + .procname = "pcp_batch_scale_max", > + .data = &pcp_batch_scale_max, > + .maxlen = sizeof(pcp_batch_scale_max), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_SIX, > + }, > { > .procname = "lowmem_reserve_ratio", > .data = &sysctl_lowmem_reserve_ratio, -- Best Regards, Huang, Ying