From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F982C2BD09 for ; Wed, 3 Jul 2024 05:36:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B58726B009C; Wed, 3 Jul 2024 01:36:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B079A6B00A3; Wed, 3 Jul 2024 01:36:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 981E46B00A5; Wed, 3 Jul 2024 01:36:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 709AF6B009C for ; Wed, 3 Jul 2024 01:36:33 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 0A4641407F3 for ; Wed, 3 Jul 2024 05:36:33 +0000 (UTC) X-FDA: 82297331466.21.A35CE38 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) by imf29.hostedemail.com (Postfix) with ESMTP id DC3BB12000C for ; Wed, 3 Jul 2024 05:36:29 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=H59jX9Zz; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719984979; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zFMlT7wox6EWZ4711a+C41M2dQ0yH2T58S086CB1xjY=; b=Jr3A9SEMOD3219L+kfgKQ8hW2s17ihjqw3q+LEx5L0RWzxEgRgL+PckQfUKUcASi9QutVy XY39rxgRWwjqD8BC/1cDGuDUgKyFHqV2ViHrpWOp+kRk8DkpMtgz+RVa6/qAVkMpp+ZvPk 6KCSxUcQonhIf70IKVexpkMCQ3QxO3M= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=H59jX9Zz; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719984979; a=rsa-sha256; cv=none; b=e+6jBMbUywcG2swGD4DzJ0Wj5tAkqgR4RW5cN6qQhcns6Pa0ngp4SMpvmkU8L8bQUEr+aa HmFPcK7lfOOGfmNODroDS7u5yGahgcoUhLdg/QbqEAPjHq89JieDej7dSVcLavCASHsB4f R0/NM6e/P9sfyWE4IdVy9wwYYuX/JZs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719984990; x=1751520990; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=ppxF+Cf2SgNccaI0eGzTGBBjKjfRlXCYC9CM+phB8UI=; b=H59jX9ZzUtM5b2qzKNFVgUuipRswrKI9Jms+J1QxLPcws+7GCtEUdvy3 FC/sPOvINZCKGtqxewN2r0hRfUtZRfn/vOab2Os2SGUPL5/JLlxgTbLw/ HfxUJNuGX+OhooqqcrpucJ0PppN7UnPRhF9Tg8utGW0BnwT4G0YPypJBF KzJFgInuk2VP5/rVijLRddiuryoBWqvUEeXwCMs3sLWgj6mkCh1ch7tt7 DBTh5PYfZgV+xoM4qjvkyW2w+5wO1sCgKq1JeIHqJwux9ffQI2EVTZVIT rObxV2FgN2zvTD3JWFDLoBdhbvH5qZgeKkGnO6Q2g3e2cMVPZmtP3zry8 w==; X-CSE-ConnectionGUID: JT9pvrjURzSZF75jyAev0w== X-CSE-MsgGUID: XnJaanoWRWiKvEkssJVhxQ== X-IronPort-AV: E=McAfee;i="6700,10204,11121"; a="12369876" X-IronPort-AV: E=Sophos;i="6.09,181,1716274800"; d="scan'208";a="12369876" Received: from fmviesa002.fm.intel.com ([10.60.135.142]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 22:36:28 -0700 X-CSE-ConnectionGUID: zf70VWscRwi2H/QNjYi2JQ== X-CSE-MsgGUID: 16OAv0z8RIGupNZMG9/Hug== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,181,1716274800"; d="scan'208";a="69314176" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 22:36:26 -0700 From: "Huang, Ying" To: Yafang Shao Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist In-Reply-To: (Yafang Shao's message of "Wed, 3 Jul 2024 11:44:08 +0800") References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 03 Jul 2024 13:34:35 +0800 Message-ID: <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: DC3BB12000C X-Stat-Signature: t55j1t1b9kktq5yn75droxy59inx7q3h X-HE-Tag: 1719984989-271920 X-HE-Meta: U2FsdGVkX18Q12EJZeDsxMBeDa2pb3dCPF9zWvuVgZDm7AJbfNaTKqhZlBwuDQugJCWjdGeo81MwqlsHBxNhtVJCvnaccjG+iqRA0ThFl7F5T/I+/AheABFqTyLmAoOkz9m/IM8VJe91X2uxar6y7B/1HNbPXsNdLGEpXGon6/bWAqbr5Ls9hd4QDril0Wxm02YLUIuht3cHOsOl3gtTV01+AAGMoJD8RnUuk8/bXMQbDjq5Q3QI1mMKudrwSB4zJqj4+WyHNuqKdgQnozwBTWwNKQDlpoohy+DkNmE3JOlZ/2Q89eSoTs7179HJulD/tJX31fDW05aT/HpuOY0q6buUO/VpOm7bqkNb1V/58NZCLJXV1qv0KwmYQqYehT9tR+vAuJryNFIeHqkAwXijYzwMGo9q2iWCmDpSr3WrNQd/2vMxUH/0kMX+9Q0sCvfsitNEzd0tq33+yz5tvJgVvJLBpVvNRFlagjm7fL9iNmo5QBijfYQr7pcig9WYmoBEiGwxuwMB5jEgLh57M3Eo7T+TcvFRmgz1/mhkNqqnUb8PHQCrOwPxXF/MQflNmFsf3RpIq7qtE0RrkGpf1MqMoHPcXtU18WLBXhjxwrmTtLR/+JzDEAuuVgTwQkeGGE3INmVIffllQY15eGvoxgfnj1DZk2S+5UOetlvomZMqeqKfuwzq1kZFjzqo7ljCxoRNdVk7p/u8yvAYcm2i9FC3MS+/0YuUtdBJ5KyZgxxG/PVacldVQeaNwaAvbnX3I76JNp6ctc4QIpORs+DqsW9xeLGI1alDpN0dNMhYP95EYSI0c9Eh2sQrvX7DjgRAR4RbDV1WcKWTXFCMnEC6EfEN/qJLdWubS4J1h4qyRsnoRbmylLJSPnflsTQUsFCzsbNGQ1q4cOyCnOkUwco246HMhV7icdFjpzpyQkdzZPFiMDXZT/ZvYdw+k4OCnt0mlEKSXxHQcu1otOpqAbgqFbA SqGrBpUk b4+r58z50IXzjxmJkitKkBL9C+V8VUXOohhfuyWjKSw4cno8adDyprzM3e04v64tnL+x0REpXKxkyRrOIIBlknqlS0Sxkwd0bWljsNv8ZxG8t/zuG5g/0noTxYMyfv0koK7ept0H6FckMoIaqqQmBSAk+63O4XYSsaBUHQrOh76WgtENqfS2aJJz0owj6KuyMMsoRQb5ULc7gpEwBFHkQ34/b8MCnMiueEMUumh28N3Mznft8y8R9Ti791YYZKYC3aOfe+tufwXw1c5DALcm51CsV9eWMkCHIjMvCGQPLP79gxzt370Hkbk0w5/rZLlTAptX21d7aVLRAIMWZg245UCNKPDiQIo8YXYVknfZxeaPrCantBg4gzEbSCA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying wrote: >> >> Yafang Shao writes: >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: >> >> >> >> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wrote: >> >> >> >> >> >> >> >> > Currently, we're encountering latency spikes in our container= environment >> >> >> >> > when a specific container with multiple Python-based tasks ex= its. These >> >> >> >> > tasks may hold the zone->lock for an extended period, signifi= cantly >> >> >> >> > impacting latency for other containers attempting to allocate= memory. >> >> >> >> >> >> >> >> Is this locking issue well understood? Is anyone working on it= ? A >> >> >> >> reasonably detailed description of the issue and a description = of any >> >> >> >> ongoing work would be helpful here. >> >> >> > >> >> >> > In our containerized environment, we have a specific type of con= tainer >> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS.= These >> >> >> > processes are organized as separate processes rather than thread= s due >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck i= n a >> >> >> > multi-threaded setup. Upon the exit of these containers, other >> >> >> > containers hosted on the same machine experience significant lat= ency >> >> >> > spikes. >> >> >> > >> >> >> > Our investigation using perf tracing revealed that the root caus= e of >> >> >> > these spikes is the simultaneous execution of exit_mmap() by eac= h of >> >> >> > the exiting processes. This concurrent access to the zone->lock >> >> >> > results in contention, which becomes a hotspot and negatively im= pacts >> >> >> > performance. The perf results clearly indicate this contention a= s a >> >> >> > primary contributor to the observed latency issues. >> >> >> > >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] >> >> >> > [k] mmput =E2=96=92 >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] >> >> >> > [k] exit_mmap =E2=96=92 >> >> >> > - 76.97% exit_mmap >> >> >> > =E2=96=92 >> >> >> > - 58.58% unmap_vmas >> >> >> > =E2=96=92 >> >> >> > - 58.55% unmap_single_vma >> >> >> > =E2=96=92 >> >> >> > - unmap_page_range >> >> >> > =E2=96=92 >> >> >> > - 58.32% zap_pte_range >> >> >> > =E2=96=92 >> >> >> > - 42.88% tlb_flush_mmu >> >> >> > =E2=96=92 >> >> >> > - 42.76% free_pages_and_swap_cache >> >> >> > =E2=96=92 >> >> >> > - 41.22% release_pages >> >> >> > =E2=96=92 >> >> >> > - 33.29% free_unref_page_list >> >> >> > =E2=96=92 >> >> >> > - 32.37% free_unref_page_commit >> >> >> > =E2=96=92 >> >> >> > - 31.64% free_pcppages_bulk >> >> >> > =E2=96=92 >> >> >> > + 28.65% _raw_spin_lock >> >> >> > =E2=96=92 >> >> >> > 1.28% __list_del_entry_val= id >> >> >> > =E2=96=92 >> >> >> > + 3.25% folio_lruvec_lock_irqsave >> >> >> > =E2=96=92 >> >> >> > + 0.75% __mem_cgroup_uncharge_list >> >> >> > =E2=96=92 >> >> >> > 0.60% __mod_lruvec_state >> >> >> > =E2=96=92 >> >> >> > 1.07% free_swap_cache >> >> >> > =E2=96=92 >> >> >> > + 11.69% page_remove_rmap >> >> >> > =E2=96=92 >> >> >> > 0.64% __mod_lruvec_page_state >> >> >> > - 17.34% remove_vma >> >> >> > =E2=96=92 >> >> >> > - 17.25% vm_area_free >> >> >> > =E2=96=92 >> >> >> > - 17.23% kmem_cache_free >> >> >> > =E2=96=92 >> >> >> > - 17.15% __slab_free >> >> >> > =E2=96=92 >> >> >> > - 14.56% discard_slab >> >> >> > =E2=96=92 >> >> >> > free_slab >> >> >> > =E2=96=92 >> >> >> > __free_slab >> >> >> > =E2=96=92 >> >> >> > __free_pages >> >> >> > =E2=96=92 >> >> >> > - free_unref_page >> >> >> > =E2=96=92 >> >> >> > - 13.50% free_unref_page_commit >> >> >> > =E2=96=92 >> >> >> > - free_pcppages_bulk >> >> >> > =E2=96=92 >> >> >> > + 13.44% _raw_spin_lock >> >> >> > >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed st= ack: >> >> >> > >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_= drain: >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp >> >> >> > e=3D1 >> >> >> > <...>-1540432 [224] d..3. 618048.023887: >> >> >> > =3D> free_pcppages_bulk >> >> >> > =3D> free_unref_page_commit >> >> >> > =3D> free_unref_page_list >> >> >> > =3D> release_pages >> >> >> > =3D> free_pages_and_swap_cache >> >> >> > =3D> tlb_flush_mmu >> >> >> > =3D> zap_pte_range >> >> >> > =3D> unmap_page_range >> >> >> > =3D> unmap_single_vma >> >> >> > =3D> unmap_vmas >> >> >> > =3D> exit_mmap >> >> >> > =3D> mmput >> >> >> > =3D> do_exit >> >> >> > =3D> do_group_exit >> >> >> > =3D> get_signal >> >> >> > =3D> arch_do_signal_or_restart >> >> >> > =3D> exit_to_user_mode_prepare >> >> >> > =3D> syscall_exit_to_user_mode >> >> >> > =3D> do_syscall_64 >> >> >> > =3D> entry_SYSCALL_64_after_hwframe >> >> >> > >> >> >> > The servers experiencing these issues are equipped with impressi= ve >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, a= ll >> >> >> > within a single NUMA node. The zoneinfo is as follows, >> >> >> > >> >> >> > Node 0, zone Normal >> >> >> > pages free 144465775 >> >> >> > boost 0 >> >> >> > min 1309270 >> >> >> > low 1636587 >> >> >> > high 1963904 >> >> >> > spanned 564133888 >> >> >> > present 296747008 >> >> >> > managed 291974346 >> >> >> > cma 0 >> >> >> > protection: (0, 0, 0, 0) >> >> >> > ... >> >> >> > ... >> >> >> > pagesets >> >> >> > cpu: 0 >> >> >> > count: 2217 >> >> >> > high: 6392 >> >> >> > batch: 63 >> >> >> > vm stats threshold: 125 >> >> >> > cpu: 1 >> >> >> > count: 4510 >> >> >> > high: 6392 >> >> >> > batch: 63 >> >> >> > vm stats threshold: 125 >> >> >> > cpu: 2 >> >> >> > count: 3059 >> >> >> > high: 6392 >> >> >> > batch: 63 >> >> >> > >> >> >> > ... >> >> >> > >> >> >> > The high is around 100 times the batch size. >> >> >> > >> >> >> > We also traced the latency associated with the free_pcppages_bul= k() >> >> >> > function during the container exit process: >> >> >> > >> >> >> > 19:48:54 >> >> >> > nsecs : count distribution >> >> >> > 0 -> 1 : 0 | = | >> >> >> > 2 -> 3 : 0 | = | >> >> >> > 4 -> 7 : 0 | = | >> >> >> > 8 -> 15 : 0 | = | >> >> >> > 16 -> 31 : 0 | = | >> >> >> > 32 -> 63 : 0 | = | >> >> >> > 64 -> 127 : 0 | = | >> >> >> > 128 -> 255 : 0 | = | >> >> >> > 256 -> 511 : 148 |***************** = | >> >> >> > 512 -> 1023 : 334 |***************************= *************| >> >> >> > 1024 -> 2047 : 33 |*** = | >> >> >> > 2048 -> 4095 : 5 | = | >> >> >> > 4096 -> 8191 : 7 | = | >> >> >> > 8192 -> 16383 : 12 |* = | >> >> >> > 16384 -> 32767 : 30 |*** = | >> >> >> > 32768 -> 65535 : 21 |** = | >> >> >> > 65536 -> 131071 : 15 |* = | >> >> >> > 131072 -> 262143 : 27 |*** = | >> >> >> > 262144 -> 524287 : 84 |********** = | >> >> >> > 524288 -> 1048575 : 203 |************************ = | >> >> >> > 1048576 -> 2097151 : 284 |***************************= ******* | >> >> >> > 2097152 -> 4194303 : 327 |***************************= ************ | >> >> >> > 4194304 -> 8388607 : 215 |************************* = | >> >> >> > 8388608 -> 16777215 : 116 |************* = | >> >> >> > 16777216 -> 33554431 : 47 |***** = | >> >> >> > 33554432 -> 67108863 : 8 | = | >> >> >> > 67108864 -> 134217727 : 3 | = | >> >> >> > >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 >> >> >> > >> >> >> > The latency can reach tens of milliseconds. >> >> >> > >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to s= et the >> >> >> > minimum pagelist high at 4 times the batch size, we were able to >> >> >> > significantly reduce the latency associated with the >> >> >> > free_pcppages_bulk() function during container exits.: >> >> >> > >> >> >> > nsecs : count distribution >> >> >> > 0 -> 1 : 0 | = | >> >> >> > 2 -> 3 : 0 | = | >> >> >> > 4 -> 7 : 0 | = | >> >> >> > 8 -> 15 : 0 | = | >> >> >> > 16 -> 31 : 0 | = | >> >> >> > 32 -> 63 : 0 | = | >> >> >> > 64 -> 127 : 0 | = | >> >> >> > 128 -> 255 : 120 | = | >> >> >> > 256 -> 511 : 365 |* = | >> >> >> > 512 -> 1023 : 201 | = | >> >> >> > 1024 -> 2047 : 103 | = | >> >> >> > 2048 -> 4095 : 84 | = | >> >> >> > 4096 -> 8191 : 87 | = | >> >> >> > 8192 -> 16383 : 4777 |************** = | >> >> >> > 16384 -> 32767 : 10572 |***************************= **** | >> >> >> > 32768 -> 65535 : 13544 |***************************= *************| >> >> >> > 65536 -> 131071 : 12723 |***************************= ********** | >> >> >> > 131072 -> 262143 : 8604 |************************* = | >> >> >> > 262144 -> 524287 : 3659 |********** = | >> >> >> > 524288 -> 1048575 : 921 |** = | >> >> >> > 1048576 -> 2097151 : 122 | = | >> >> >> > 2097152 -> 4194303 : 5 | = | >> >> >> > >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 >> >> >> > >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction s= ysctl >> >> >> > knob to set the minimum pagelist high at a level that effectively >> >> >> > mitigated latency issues, we observed that other containers were= no >> >> >> > longer experiencing similar complaints. As a result, we decided = to >> >> >> > implement this tuning as a permanent workaround and have deploye= d it >> >> >> > across all clusters of servers where these containers may be dep= loyed. >> >> >> >> >> >> Thanks for your detailed data. >> >> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting >> >> >> shouldn't be a problem? >> >> > >> >> > Right. The problem arises when the process holds the lock for too >> >> > long, causing other processes that are attempting to allocate memory >> >> > to experience delays or wait times. >> >> > >> >> >> Because users care more about the total time of >> >> >> process exiting, that is, throughput. And I suspect that the zone= ->lock >> >> >> contention and page allocating/freeing throughput will be worse wi= th >> >> >> your configuration? >> >> > >> >> > While reducing throughput may not be a significant concern due to t= he >> >> > minimal difference, the potential for latency spikes, a crucial met= ric >> >> > for assessing system stability, is of greater concern to users. Hig= her >> >> > latency can lead to request errors, impacting the user experience. >> >> > Therefore, maintaining stability, even at the cost of slightly lower >> >> > throughput, is preferable to experiencing higher throughput with >> >> > unstable performance. >> >> > >> >> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation in oth= er >> >> >> processes is a problem. And your configuration can help it. >> >> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that w= ay, >> >> >> you have a normal PCP size (high) but smaller PCP batch. I guess = that >> >> >> may help both latency and throughput in your system. Could you gi= ve it >> >> >> a try? >> >> > >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_M= AX >> >> > configuration option. However, I've observed your recent improvemen= ts >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm: >> >> > restrict the pcp batch scale factor to avoid too long latency"), wh= ich >> >> > has prompted me to experiment with manually setting the >> >> > pcp->free_factor to zero. While this adjustment provided some >> >> > improvement, the results were not as significant as I had hoped. >> >> > >> >> > BTW, perhaps we should consider the implementation of a sysctl knob= as >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users >> >> > to more easily adjust it. >> >> >> >> If you cannot test upstream behavior, it's hard to make changes to >> >> upstream. Could you find a way to do that? >> > >> > I'm afraid I can't run an upstream kernel in our production environmen= t :( >> > Lots of code changes have to be made. >> >> Understand. Can you find a way to test upstream behavior, not upstream >> kernel exactly? Or test the upstream kernel but in a similar but not >> exactly production environment. > > I'm willing to give it a try, but it may take some time to achieve the > desired results.. Thanks! >> >> >> IIUC, PCP high will not influence allocate/free latency, PCP batch wi= ll. >> > >> > It seems incorrect. >> > Looks at the code in free_unref_page_commit() : >> > >> > if (pcp->count >=3D high) { >> > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_hi= gh), >> > pcp, pindex); >> > } >> > >> > And nr_pcp_free() : >> > min_nr_free =3D batch; >> > max_nr_free =3D high - batch; >> > >> > batch =3D clamp_t(int, pcp->free_count, min_nr_free, max_nr_free); >> > return batch; >> > >> > The 'batch' is not a fixed value but changed dynamically, isn't it ? >> >> Sorry, my words were confusing. For 'batch', I mean the value of the >> "count" parameter of free_pcppages_bulk() actually. For example, if we >> change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that. > > If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is > that the pcp->free_count should not exceed (63 << 0), right ? (suppose > 63 is the default batch size) > However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) , > is that expected ? > > Perhaps we should make the change below? > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e7313f9d704b..8c52a30201d1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone > *zone, struct per_cpu_pages *pcp, > } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { > pcp->flags &=3D ~PCPF_PREV_FREE_HIGH_ORDER; > } > - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) > + if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) { > pcp->free_count +=3D (1 << order); > + if (unlikely(pcp->free_count > (batch << > CONFIG_PCP_BATCH_SCALE_MAX))) > + pcp->free_count =3D batch << CONFIG_PCP_BATCH_SCA= LE_MAX; > + } > high =3D nr_pcp_high(pcp, zone, batch, free_high); > if (pcp->count >=3D high) { > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, > free_high), Or pcp->free_count =3D max(pcp->free_count + (1<< order), batch << CONFIG_PCP_BATCH_SCALE_MAX); -- Best Regards, Huang, Ying