From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 504E2C30653 for ; Fri, 5 Jul 2024 01:30:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 705D36B0092; Thu, 4 Jul 2024 21:30:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 68E8E6B0096; Thu, 4 Jul 2024 21:30:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B9916B0098; Thu, 4 Jul 2024 21:30:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 24C876B0092 for ; Thu, 4 Jul 2024 21:30:34 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 012D91A1619 for ; Fri, 5 Jul 2024 01:30:32 +0000 (UTC) X-FDA: 82303969146.14.BC3A0FC Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf07.hostedemail.com (Postfix) with ESMTP id D57C94001E for ; Fri, 5 Jul 2024 01:30:27 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="YWR/lhDC"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720143009; a=rsa-sha256; cv=none; b=f3Sltbn9FNjRptXKDcsRNNsqshP5lyvKpQ0QzNNRm7Eeu6/gNG27UOA/pQRmRGnBFnmlxF k5Bykewq/oNnwpW72l3Y+d8BBFJzTEJa8zgW7yRIS1NSy02ukQq51WYdCU1Tf+yEh2atgF 1f/sFsbkTg66QEnLHd8UwTkou02PY94= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="YWR/lhDC"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf07.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720143009; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=t2Tn4qnZ2EakYbcOrYSRNcf0Hrjdr9z2bBRjF5PAPtE=; b=m3eRXkLLyElMxdr4EVtWN86/ATSWRCztf0TowYR7IE1UIry1nbjz4e2jBBL8Gn8MYs6qgZ GZYEarRPu0+8B2yYK5UcnHI5c+bZj7SLnEwsO//dSU7E07jUgcQOg+hdo1a+30Gx4heyb4 9YBcecCuAsPbc/yd+HIpkP58A8eIMxg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720143030; x=1751679030; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=denJiCFhQniH3P/mOqOYfWsNtdBqj5agXZEruG2Lrg8=; b=YWR/lhDC1JVQ/lDoN0QXLbXaQ7K9yGunVc3DJ68qSPaNFaGS+ErDARsF mdCSn9AVJ6mn61RwOiKE2xOYpYMDAwv9O3QUUAXb200TOmFP0IoL4g4Vf d8TO7YCzEl4lccOnBeL+wVmI+ILbJnRffWU8gJIzZWdelSX20uwGR/wZh US1FuXTMIKzEqDqf74SEHAthQXzATvGd9sz2hsqVCEH9TZWo+Z/g3iynQ gbN3cWLB4dMjZL5sm2rkMDgTOHYtziew5JdGtNuLWUXFSYeYbGDr0a4BL EhYp3ZFstPo3IPHddKx0dDM0gPmoU+tZTsr6xKu+OTJhOJUmEtQNQ693T Q==; X-CSE-ConnectionGUID: L75VuF0/Tg+keB6DiqlFYA== X-CSE-MsgGUID: j9UR9HUUQkq3gXyIKuDvEw== X-IronPort-AV: E=McAfee;i="6700,10204,11123"; a="28030105" X-IronPort-AV: E=Sophos;i="6.09,183,1716274800"; d="scan'208";a="28030105" Received: from orviesa003.jf.intel.com ([10.64.159.143]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 18:30:13 -0700 X-CSE-ConnectionGUID: zM2CV2svRtCEQr4HrGJMjw== X-CSE-MsgGUID: 3/EYdrAlQ2mojQCIGwq2eQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,183,1716274800"; d="scan'208";a="51587987" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by ORVIESA003-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 18:30:11 -0700 From: "Huang, Ying" To: Yafang Shao Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist In-Reply-To: (Yafang Shao's message of "Thu, 4 Jul 2024 21:27:59 +0800") References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 05 Jul 2024 09:28:18 +0800 Message-ID: <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: D57C94001E X-Stat-Signature: m99xqdseei4mijrt9bx6jsmn7rttdewh X-Rspam-User: X-HE-Tag: 1720143027-311432 X-HE-Meta: U2FsdGVkX18kcntzRO7wy69onra1JQRr8DttmL66vxhosMdSD6CaPjy+DEYo+WivaGWLvR7MfpPaMrzeLBFVEzFHepkobk/A7qrLkxAoCl8IXHS+bjpPIQPhI3cwOJm1azfjSEsbBiVCmiHHYMz/rDJPhX1yx3p17tFAou4ASxKXHMdghBBra/SNRqp4mGCX4nKGbsvn7dvXhTW3/E+rdfUrJWMvL6D0J/NYu1+wVg+SGRiXn3liygILSckniwdSpvUcbVYVA5NobE8Rd8IPstiDaaz1VAlJWRyc5vo7QQkYmjWYd99IH48LMTOikAfuR4ZAIGtJ1mw1rQF7BdwufpBOHQ/26F0TMHSJeoGSoTKX5SjnEh9Euax57QNI6lFFH+Ol8jM6lvYCf8Yn14OrJkL7mcjuqxXKIlHGNC0/whyI5ju5KVNmSket9aqY12JxiSTu0Chy8H5VgxrNsgmmHzcDk2O/vHcXyJi09hs/mdaNvhb3Urt4Q47uUzNdXVsUYEuQ2e/w16hnlR8ho98rk7gDoQ20nO+7aaPcD7pqaEpemGDHBl8NqcLA0xpbHldLvW5fPoOoNPnfJCYHquk+AUdMJWHvLph17bUI1laYGfmEs9FuvfApI7X12OKJ7sm7RCYtISXyRtzEdpgqNMwHGrRPN+tXGr1xrW4Gaafi/nTXMMNFHk2dLt/Fd1NeS/IjIjJf35abSYKbyiJbORT3V7iNfSaBd7nAwYFWCuf8jld1yUcHs2U908M85zJaPDBln7A2VZEKaCPn5GVQFNL019YGbe/IMjelerW5XkFL7IKgFkrYC3xbznfJ4NY0fL2d0+4oZrzemMAq3BXKONYzdrH293ojnGBIQyq/DxH6KZUMgXQdp5Rha78PP9glKmNcn5xi/pOhbNuQB+s4NTLpV6i4G8TyarMQ6DjnPJKTvA70M69j2vsZUuTuEm3j8H1ydDVEYJniEW1YKVGOoxU 2tPF1kYu jC/yT+ypxUllcC+7jBYhzu76yD3Ncs9l4Yg5QPTLyaTn7Ow4x/2Y6Y5KvTpLs+scjP1E9+X7z3xHgOfcnwMwPEQXaZYpk5a9XSNEOOgbl72et9Vug3QAwYlvjamp3LZ6R4XvDAs1BfRzVJyK2bVgKZkmY6tIYWDBgSADlORUxkZriMF2zY0LpRVS7gRh+O2lkxS5pykafvJj8IQTfIzrLGvY5B8uPT3Z6WW6XjunqLtU3HPHZCP9u7HopY6DzUbQGFa6BGAYabtHPSJq8MhXxIZTJl1x/U5N3posSxWFhIuCoTeFTEXon0ey/FrFCsTpLeUVE2B24yB+uhbo0k0c4CIlyVzzkHJxaKGJCcNdTbLrMWcMRoIg7Qo+pxA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Wed, Jul 3, 2024 at 1:36=E2=80=AFPM Huang, Ying = wrote: >> >> Yafang Shao writes: >> >> > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying wrote: >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: >> >> >> >> >> >> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wrote: >> >> >> >> >> >> >> >> >> >> > Currently, we're encountering latency spikes in our contai= ner environment >> >> >> >> >> > when a specific container with multiple Python-based tasks= exits. These >> >> >> >> >> > tasks may hold the zone->lock for an extended period, sign= ificantly >> >> >> >> >> > impacting latency for other containers attempting to alloc= ate memory. >> >> >> >> >> >> >> >> >> >> Is this locking issue well understood? Is anyone working on= it? A >> >> >> >> >> reasonably detailed description of the issue and a descripti= on of any >> >> >> >> >> ongoing work would be helpful here. >> >> >> >> > >> >> >> >> > In our containerized environment, we have a specific type of = container >> >> >> >> > that runs 18 processes, each consuming approximately 6GB of R= SS. These >> >> >> >> > processes are organized as separate processes rather than thr= eads due >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottlenec= k in a >> >> >> >> > multi-threaded setup. Upon the exit of these containers, other >> >> >> >> > containers hosted on the same machine experience significant = latency >> >> >> >> > spikes. >> >> >> >> > >> >> >> >> > Our investigation using perf tracing revealed that the root c= ause of >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by = each of >> >> >> >> > the exiting processes. This concurrent access to the zone->lo= ck >> >> >> >> > results in contention, which becomes a hotspot and negatively= impacts >> >> >> >> > performance. The perf results clearly indicate this contentio= n as a >> >> >> >> > primary contributor to the observed latency issues. >> >> >> >> > >> >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] >> >> >> >> > [k] mmput =E2=96= =92 >> >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] >> >> >> >> > [k] exit_mmap =E2=96= =92 >> >> >> >> > - 76.97% exit_mmap >> >> >> >> > =E2=96= =92 >> >> >> >> > - 58.58% unmap_vmas >> >> >> >> > =E2=96= =92 >> >> >> >> > - 58.55% unmap_single_vma >> >> >> >> > =E2=96= =92 >> >> >> >> > - unmap_page_range >> >> >> >> > =E2=96= =92 >> >> >> >> > - 58.32% zap_pte_range >> >> >> >> > =E2=96= =92 >> >> >> >> > - 42.88% tlb_flush_mmu >> >> >> >> > =E2=96= =92 >> >> >> >> > - 42.76% free_pages_and_swap_cache >> >> >> >> > =E2=96= =92 >> >> >> >> > - 41.22% release_pages >> >> >> >> > =E2=96= =92 >> >> >> >> > - 33.29% free_unref_page_list >> >> >> >> > =E2=96= =92 >> >> >> >> > - 32.37% free_unref_page_commit >> >> >> >> > =E2=96= =92 >> >> >> >> > - 31.64% free_pcppages_bulk >> >> >> >> > =E2=96= =92 >> >> >> >> > + 28.65% _raw_spin_lock >> >> >> >> > =E2=96= =92 >> >> >> >> > 1.28% __list_del_entry_= valid >> >> >> >> > =E2=96= =92 >> >> >> >> > + 3.25% folio_lruvec_lock_irqsave >> >> >> >> > =E2=96= =92 >> >> >> >> > + 0.75% __mem_cgroup_uncharge_list >> >> >> >> > =E2=96= =92 >> >> >> >> > 0.60% __mod_lruvec_state >> >> >> >> > =E2=96= =92 >> >> >> >> > 1.07% free_swap_cache >> >> >> >> > =E2=96= =92 >> >> >> >> > + 11.69% page_remove_rmap >> >> >> >> > =E2=96= =92 >> >> >> >> > 0.64% __mod_lruvec_page_state >> >> >> >> > - 17.34% remove_vma >> >> >> >> > =E2=96= =92 >> >> >> >> > - 17.25% vm_area_free >> >> >> >> > =E2=96= =92 >> >> >> >> > - 17.23% kmem_cache_free >> >> >> >> > =E2=96= =92 >> >> >> >> > - 17.15% __slab_free >> >> >> >> > =E2=96= =92 >> >> >> >> > - 14.56% discard_slab >> >> >> >> > =E2=96= =92 >> >> >> >> > free_slab >> >> >> >> > =E2=96= =92 >> >> >> >> > __free_slab >> >> >> >> > =E2=96= =92 >> >> >> >> > __free_pages >> >> >> >> > =E2=96= =92 >> >> >> >> > - free_unref_page >> >> >> >> > =E2=96= =92 >> >> >> >> > - 13.50% free_unref_page_commit >> >> >> >> > =E2=96= =92 >> >> >> >> > - free_pcppages_bulk >> >> >> >> > =E2=96= =92 >> >> >> >> > + 13.44% _raw_spin_lock >> >> >> >> > >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed= stack: >> >> >> >> > >> >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pc= pu_drain: >> >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp >> >> >> >> > e=3D1 >> >> >> >> > <...>-1540432 [224] d..3. 618048.023887: >> >> >> >> > =3D> free_pcppages_bulk >> >> >> >> > =3D> free_unref_page_commit >> >> >> >> > =3D> free_unref_page_list >> >> >> >> > =3D> release_pages >> >> >> >> > =3D> free_pages_and_swap_cache >> >> >> >> > =3D> tlb_flush_mmu >> >> >> >> > =3D> zap_pte_range >> >> >> >> > =3D> unmap_page_range >> >> >> >> > =3D> unmap_single_vma >> >> >> >> > =3D> unmap_vmas >> >> >> >> > =3D> exit_mmap >> >> >> >> > =3D> mmput >> >> >> >> > =3D> do_exit >> >> >> >> > =3D> do_group_exit >> >> >> >> > =3D> get_signal >> >> >> >> > =3D> arch_do_signal_or_restart >> >> >> >> > =3D> exit_to_user_mode_prepare >> >> >> >> > =3D> syscall_exit_to_user_mode >> >> >> >> > =3D> do_syscall_64 >> >> >> >> > =3D> entry_SYSCALL_64_after_hwframe >> >> >> >> > >> >> >> >> > The servers experiencing these issues are equipped with impre= ssive >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory= , all >> >> >> >> > within a single NUMA node. The zoneinfo is as follows, >> >> >> >> > >> >> >> >> > Node 0, zone Normal >> >> >> >> > pages free 144465775 >> >> >> >> > boost 0 >> >> >> >> > min 1309270 >> >> >> >> > low 1636587 >> >> >> >> > high 1963904 >> >> >> >> > spanned 564133888 >> >> >> >> > present 296747008 >> >> >> >> > managed 291974346 >> >> >> >> > cma 0 >> >> >> >> > protection: (0, 0, 0, 0) >> >> >> >> > ... >> >> >> >> > ... >> >> >> >> > pagesets >> >> >> >> > cpu: 0 >> >> >> >> > count: 2217 >> >> >> >> > high: 6392 >> >> >> >> > batch: 63 >> >> >> >> > vm stats threshold: 125 >> >> >> >> > cpu: 1 >> >> >> >> > count: 4510 >> >> >> >> > high: 6392 >> >> >> >> > batch: 63 >> >> >> >> > vm stats threshold: 125 >> >> >> >> > cpu: 2 >> >> >> >> > count: 3059 >> >> >> >> > high: 6392 >> >> >> >> > batch: 63 >> >> >> >> > >> >> >> >> > ... >> >> >> >> > >> >> >> >> > The high is around 100 times the batch size. >> >> >> >> > >> >> >> >> > We also traced the latency associated with the free_pcppages_= bulk() >> >> >> >> > function during the container exit process: >> >> >> >> > >> >> >> >> > 19:48:54 >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> > 256 -> 511 : 148 |***************** = | >> >> >> >> > 512 -> 1023 : 334 |************************= ****************| >> >> >> >> > 1024 -> 2047 : 33 |*** = | >> >> >> >> > 2048 -> 4095 : 5 | = | >> >> >> >> > 4096 -> 8191 : 7 | = | >> >> >> >> > 8192 -> 16383 : 12 |* = | >> >> >> >> > 16384 -> 32767 : 30 |*** = | >> >> >> >> > 32768 -> 65535 : 21 |** = | >> >> >> >> > 65536 -> 131071 : 15 |* = | >> >> >> >> > 131072 -> 262143 : 27 |*** = | >> >> >> >> > 262144 -> 524287 : 84 |********** = | >> >> >> >> > 524288 -> 1048575 : 203 |************************= | >> >> >> >> > 1048576 -> 2097151 : 284 |************************= ********** | >> >> >> >> > 2097152 -> 4194303 : 327 |************************= *************** | >> >> >> >> > 4194304 -> 8388607 : 215 |************************= * | >> >> >> >> > 8388608 -> 16777215 : 116 |************* = | >> >> >> >> > 16777216 -> 33554431 : 47 |***** = | >> >> >> >> > 33554432 -> 67108863 : 8 | = | >> >> >> >> > 67108864 -> 134217727 : 3 | = | >> >> >> >> > >> >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 >> >> >> >> > >> >> >> >> > The latency can reach tens of milliseconds. >> >> >> >> > >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter t= o set the >> >> >> >> > minimum pagelist high at 4 times the batch size, we were able= to >> >> >> >> > significantly reduce the latency associated with the >> >> >> >> > free_pcppages_bulk() function during container exits.: >> >> >> >> > >> >> >> >> > nsecs : count distribution >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> > 128 -> 255 : 120 | = | >> >> >> >> > 256 -> 511 : 365 |* = | >> >> >> >> > 512 -> 1023 : 201 | = | >> >> >> >> > 1024 -> 2047 : 103 | = | >> >> >> >> > 2048 -> 4095 : 84 | = | >> >> >> >> > 4096 -> 8191 : 87 | = | >> >> >> >> > 8192 -> 16383 : 4777 |************** = | >> >> >> >> > 16384 -> 32767 : 10572 |************************= ******* | >> >> >> >> > 32768 -> 65535 : 13544 |************************= ****************| >> >> >> >> > 65536 -> 131071 : 12723 |************************= ************* | >> >> >> >> > 131072 -> 262143 : 8604 |************************= * | >> >> >> >> > 262144 -> 524287 : 3659 |********** = | >> >> >> >> > 524288 -> 1048575 : 921 |** = | >> >> >> >> > 1048576 -> 2097151 : 122 | = | >> >> >> >> > 2097152 -> 4194303 : 5 | = | >> >> >> >> > >> >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 >> >> >> >> > >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fractio= n sysctl >> >> >> >> > knob to set the minimum pagelist high at a level that effecti= vely >> >> >> >> > mitigated latency issues, we observed that other containers w= ere no >> >> >> >> > longer experiencing similar complaints. As a result, we decid= ed to >> >> >> >> > implement this tuning as a permanent workaround and have depl= oyed it >> >> >> >> > across all clusters of servers where these containers may be = deployed. >> >> >> >> >> >> >> >> Thanks for your detailed data. >> >> >> >> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting >> >> >> >> shouldn't be a problem? >> >> >> > >> >> >> > Right. The problem arises when the process holds the lock for too >> >> >> > long, causing other processes that are attempting to allocate me= mory >> >> >> > to experience delays or wait times. >> >> >> > >> >> >> >> Because users care more about the total time of >> >> >> >> process exiting, that is, throughput. And I suspect that the z= one->lock >> >> >> >> contention and page allocating/freeing throughput will be worse= with >> >> >> >> your configuration? >> >> >> > >> >> >> > While reducing throughput may not be a significant concern due t= o the >> >> >> > minimal difference, the potential for latency spikes, a crucial = metric >> >> >> > for assessing system stability, is of greater concern to users. = Higher >> >> >> > latency can lead to request errors, impacting the user experienc= e. >> >> >> > Therefore, maintaining stability, even at the cost of slightly l= ower >> >> >> > throughput, is preferable to experiencing higher throughput with >> >> >> > unstable performance. >> >> >> > >> >> >> >> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation in = other >> >> >> >> processes is a problem. And your configuration can help it. >> >> >> >> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In tha= t way, >> >> >> >> you have a normal PCP size (high) but smaller PCP batch. I gue= ss that >> >> >> >> may help both latency and throughput in your system. Could you= give it >> >> >> >> a try? >> >> >> > >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCAL= E_MAX >> >> >> > configuration option. However, I've observed your recent improve= ments >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("= mm: >> >> >> > restrict the pcp batch scale factor to avoid too long latency"),= which >> >> >> > has prompted me to experiment with manually setting the >> >> >> > pcp->free_factor to zero. While this adjustment provided some >> >> >> > improvement, the results were not as significant as I had hoped. >> >> >> > >> >> >> > BTW, perhaps we should consider the implementation of a sysctl k= nob as >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow u= sers >> >> >> > to more easily adjust it. >> >> >> >> >> >> If you cannot test upstream behavior, it's hard to make changes to >> >> >> upstream. Could you find a way to do that? >> >> > >> >> > I'm afraid I can't run an upstream kernel in our production environ= ment :( >> >> > Lots of code changes have to be made. >> >> >> >> Understand. Can you find a way to test upstream behavior, not upstre= am >> >> kernel exactly? Or test the upstream kernel but in a similar but not >> >> exactly production environment. >> > >> > I'm willing to give it a try, but it may take some time to achieve the >> > desired results.. >> >> Thanks! > > After I backported the series "mm: PCP high auto-tuning," which > consists of a total of 9 patches, to our 6.1.y stable kernel and > deployed it to our production envrionment, I observed a significant > reduction in latency. The results are as follows: > > nsecs : count distribution > 0 -> 1 : 0 | = | > 2 -> 3 : 0 | = | > 4 -> 7 : 0 | = | > 8 -> 15 : 0 | = | > 16 -> 31 : 0 | = | > 32 -> 63 : 0 | = | > 64 -> 127 : 0 | = | > 128 -> 255 : 0 | = | > 256 -> 511 : 0 | = | > 512 -> 1023 : 0 | = | > 1024 -> 2047 : 2 | = | > 2048 -> 4095 : 11 | = | > 4096 -> 8191 : 3 | = | > 8192 -> 16383 : 1 | = | > 16384 -> 32767 : 2 | = | > 32768 -> 65535 : 7 | = | > 65536 -> 131071 : 198 |********* = | > 131072 -> 262143 : 530 |************************ = | > 262144 -> 524287 : 824 |************************************= ** | > 524288 -> 1048575 : 852 |************************************= ****| > 1048576 -> 2097151 : 714 |********************************* = | > 2097152 -> 4194303 : 389 |****************** = | > 4194304 -> 8388607 : 143 |****** = | > 8388608 -> 16777215 : 29 |* = | > 16777216 -> 33554431 : 1 | = | > > avg =3D 1181478 nsecs, total: 4380921824 nsecs, count: 3708 > > Compared to the previous data, the maximum latency has been reduced to > less than 30ms. That series can reduce the allocation/freeing from/to the buddy system, thus reduce the lock contention. > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max, > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum > latency was further reduced to less than 2ms. > > nsecs : count distribution > 0 -> 1 : 0 | = | > 2 -> 3 : 0 | = | > 4 -> 7 : 0 | = | > 8 -> 15 : 0 | = | > 16 -> 31 : 0 | = | > 32 -> 63 : 0 | = | > 64 -> 127 : 0 | = | > 128 -> 255 : 0 | = | > 256 -> 511 : 0 | = | > 512 -> 1023 : 0 | = | > 1024 -> 2047 : 36 | = | > 2048 -> 4095 : 5063 |***** = | > 4096 -> 8191 : 31226 |******************************** = | > 8192 -> 16383 : 37606 |************************************= *** | > 16384 -> 32767 : 38359 |************************************= ****| > 32768 -> 65535 : 30652 |******************************* = | > 65536 -> 131071 : 18714 |******************* = | > 131072 -> 262143 : 7968 |******** = | > 262144 -> 524287 : 1996 |** = | > 524288 -> 1048575 : 302 | = | > 1048576 -> 2097151 : 19 | = | > > avg =3D 40702 nsecs, total: 7002105331 nsecs, count: 172031 > > After multiple trials, I observed no significant differences between > each attempt. The test results looks good. > Therefore, we decided to backport your improvements to our local > kernel. Additionally, I propose introducing a new sysctl knob, > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users > to easily tune the setting based on their specific workloads. The downside is that the pcp->high decaying (in decay_pcp_high()) will be slower. That is, it will take longer for idle pages to be freed from PCP to buddy. One possible solution is to keep the decaying page number, but use a loop as follows to control latency. while (count < decay_number) { spin_lock(); free_pcppages_bulk(, batch, ); spin_unlock(); count -=3D batch; if (count) cond_resched(); } -- Best Regards, Huang, Ying