From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D897C3DA47 for ; Thu, 11 Jul 2024 08:38:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A15926B007B; Thu, 11 Jul 2024 04:38:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C3996B009A; Thu, 11 Jul 2024 04:38:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88B836B009C; Thu, 11 Jul 2024 04:38:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 651036B009A for ; Thu, 11 Jul 2024 04:38:30 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 066A7120472 for ; Thu, 11 Jul 2024 08:38:30 +0000 (UTC) X-FDA: 82326820380.21.47B5CCA Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.8]) by imf12.hostedemail.com (Postfix) with ESMTP id D66544001A for ; Thu, 11 Jul 2024 08:38:26 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=e9Ib00jc; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720687091; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nbTs2HFkjU3D/i0Dg/NWSIhymjukZi2zRzFnQJaZRds=; b=tC+TBeB2HnQclgkCGxI6LdzhN5n6zgCdWrzNi3jBpSvsfvnpSvZG+wRonrQtxj0D5tP1+1 UO2zFXQpm3qABsB3fLymm6EzkM0uJIh9EYwFKDTO/B4GK4gJio8ZXPqo3NZF6r55u2CZ1w 2kQeH67C3RqeaWAOmj/oQhPdnpmGk/g= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=e9Ib00jc; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.8 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720687091; a=rsa-sha256; cv=none; b=cUAaGSD0q5Lkxj7Xxzq8Y+NoldVQ9iacbU1UyQdKMWcDxEK/eSTKXkuHwunHUVGIDMwlwF ZXnVywhm1skpoD7cS7LuoFNAmoLfdHYkbUWhhRd0QTIPX/Z5gaIvwD328r7lI2EagGPPTw fxP2qKyirPIE8ctTjcvFtzasuvk17IA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720687107; x=1752223107; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=5WVkCeJniTG2DXyR1GC4hsmtxAgHjbxC+c2+c8H7Ib8=; b=e9Ib00jcNZHOnJX/7SUeAkd17DRvAxER3pTvP8ttNLUYqfwA2Di5TIpr J6fr8FvgstbYwWm20L96Vo2D+QBam1DYTRhFzvRQ6V4svEAbn3D55jmLa Hsjasz1NYPrOmrLDVXdmhGuQLNBAaTfloZegk5ji9V+/Nt7UP0FCeX8cx o40Ml8kMp74Nr4pLbrJyvkuWvk4UAEU3OpgA3jPCvW8S9DSXp0x3Uhldm lOOB47r7BOY5qXsS/5uYRe/IByJAl3XkBHA8reMi3KWeGyxuODLHv6fJ+ IwXu/1y9YCKnrnHntpvBeQBI7+wUDvwWayXw3ko7tQorwlw02DwUyZMjG g==; X-CSE-ConnectionGUID: zXl/0uthQGyRChH5+v+XEQ== X-CSE-MsgGUID: Sl9mVGWNSdGh89jnP6WovQ== X-IronPort-AV: E=McAfee;i="6700,10204,11129"; a="35592016" X-IronPort-AV: E=Sophos;i="6.09,199,1716274800"; d="scan'208";a="35592016" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by fmvoesa102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jul 2024 01:38:25 -0700 X-CSE-ConnectionGUID: 2BECwB2/Qq+Jumde6ACzyg== X-CSE-MsgGUID: 7yf6o7ubS/CN0GiOU+r5Ew== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,199,1716274800"; d="scan'208";a="48247300" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Jul 2024 01:38:20 -0700 From: "Huang, Ying" To: Yafang Shao Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max In-Reply-To: (Yafang Shao's message of "Thu, 11 Jul 2024 15:21:43 +0800") References: <20240707094956.94654-1-laoar.shao@gmail.com> <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sewga0wx.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 11 Jul 2024 16:36:27 +0800 Message-ID: <87bk349vg4.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: D66544001A X-Stat-Signature: gz7swxikxspxzejgoketzkj8thaftn8b X-Rspam-User: X-HE-Tag: 1720687106-596865 X-HE-Meta: U2FsdGVkX18vdjF9pQXnXQUPlajX6lIxgBCuJrsGBHqfsficSpRSFkloBx3P8wUJYNcMaKN1x4zsAP8P3EfStnlwD84oFDWxC3xA1y36/vPS2l/0J/fu3pXSMyNqcAHyM8P1rN1Pa/Tg0zAu/s55lMiQC4sVVMPzwNCljMkijfyTWUS0k3hlJVAWnEr8DRPuM9YO/FYjT91xxkXULFOa2JdX54i9RRxK0jju5x4Cy1Fng6YT+60dRHVrvdLyF7uvK8Ug8lD7FUah4DQKJQD2nGQD5NFIJHI3s7EJ68dotX0Sosx/d3w0AX5T9SfPQcDmhuFVvgwGE59oi5C4tqTK55GGEjzHJuWdCqH+9C4XAfcu255f1ynhr5wplvs92fEdIWCNgPXMPZtmVaJEPbAcnEglm0j2xHNkJsaxskQURvDmAjO8Zo62nVdPI5+yTbV3wqF/EypTsrdlhMKsJiwCQEtW7jM7Ec2CKxvTuYL6VaT3B62E6nuUFaO1FJye9kVR0UyYb1bgXD7PNb+Ap4bKb62eoQEVCcysYjtoxCpESJGK/iSumeyFvzIomG6o7eA4rYW+xuuVe2HLjz+WI0Rvu3bDoUC7FGwN453ZnVDy1YJYS4LxcF71EyY4pvPSchpyj7I6sQ5Lt/WdYe9+PRoF8UeHg/oaVIrzh18LvhP5DHGTTyjxtNgIhaVEBKDQClV3urCu6mceKMwcpTL6kwtb0SNx2DStJP4s4mTOoTmU4BAXU8Ro4/SPYtuTi2WmZ2skrbVgSL6EKXzFmHkoMqkTl2Wm97anZ7hoV4vp5Sm5r6LiWKf6HSmbztNXGTDa2rwLRIPTPriSbeZUvh1ZfSZhShpnv8IJFoZyJpwbMxiV+qHihzOpqbgu+tMO7/7NyRFDxjZOO6hLu6nZqwI3LmUpka8CSwH+LQb8NsrDzlCy2kEVDf2laf+y01YMf/w/63ofKYA92fp4mzT9t537Oqh XW1YHZsi 10+PWHCCzY3h4SjDkeKwW4pifZLmEhD160meSYLJcvFxCQYPBfS8OqOCoYoWc9udenDTiAL1U1ywuJZPNQX3j793epQnSl6zYS7fvoT6G1/rOtVJNg3VdBLJ2ULwPuHN40c5r0AYuvw9v2zBTtb+TQFxwM18O8nYtTicZhYPNZAR2mYfITbZ9Kl5j6QARr4MZc92tOhaK8NfrlfdjR2TOnJB5OzV/Fq1PSZzuAlqQffGf+fxNoivwi7zAVb8DiNkg/oNhOI0TqPIQ1m6jKDpWGrifG+yNp2aZ0Vqz1tWDgFT54uGsceVFnn35EA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Thu, Jul 11, 2024 at 2:40=E2=80=AFPM Huang, Ying wrote: >> >> Yafang Shao writes: >> >> > On Wed, Jul 10, 2024 at 11:02=E2=80=AFAM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > Background >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > >> >> > In our containerized environment, we have a specific type of contai= ner >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. Th= ese >> >> > processes are organized as separate processes rather than threads d= ue >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a >> >> > multi-threaded setup. Upon the exit of these containers, other >> >> > containers hosted on the same machine experience significant latency >> >> > spikes. >> >> > >> >> > Investigation >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > >> >> > My investigation using perf tracing revealed that the root cause of >> >> > these spikes is the simultaneous execution of exit_mmap() by each of >> >> > the exiting processes. This concurrent access to the zone->lock >> >> > results in contention, which becomes a hotspot and negatively impac= ts >> >> > performance. The perf results clearly indicate this contention as a >> >> > primary contributor to the observed latency issues. >> >> > >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] = [k] mmput >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] = [k] exit_mmap >> >> > - 76.97% exit_mmap >> >> > - 58.58% unmap_vmas >> >> > - 58.55% unmap_single_vma >> >> > - unmap_page_range >> >> > - 58.32% zap_pte_range >> >> > - 42.88% tlb_flush_mmu >> >> > - 42.76% free_pages_and_swap_cache >> >> > - 41.22% release_pages >> >> > - 33.29% free_unref_page_list >> >> > - 32.37% free_unref_page_commit >> >> > - 31.64% free_pcppages_bulk >> >> > + 28.65% _raw_spin_lock >> >> > 1.28% __list_del_entry_valid >> >> > + 3.25% folio_lruvec_lock_irqsave >> >> > + 0.75% __mem_cgroup_uncharge_list >> >> > 0.60% __mod_lruvec_state >> >> > 1.07% free_swap_cache >> >> > + 11.69% page_remove_rmap >> >> > 0.64% __mod_lruvec_page_state >> >> > - 17.34% remove_vma >> >> > - 17.25% vm_area_free >> >> > - 17.23% kmem_cache_free >> >> > - 17.15% __slab_free >> >> > - 14.56% discard_slab >> >> > free_slab >> >> > __free_slab >> >> > __free_pages >> >> > - free_unref_page >> >> > - 13.50% free_unref_page_commit >> >> > - free_pcppages_bulk >> >> > + 13.44% _raw_spin_lock >> >> >> >> I don't think your change will reduce zone->lock contention cycles. = So, >> >> I don't find the value of the above data. >> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertinent pa= ge, >> >> > with the majority of them being regular order-0 user pages. >> >> > >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_dra= in: page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp >> >> > e=3D1 >> >> > <...>-1540432 [224] d..3. 618048.023887: >> >> > =3D> free_pcppages_bulk >> >> > =3D> free_unref_page_commit >> >> > =3D> free_unref_page_list >> >> > =3D> release_pages >> >> > =3D> free_pages_and_swap_cache >> >> > =3D> tlb_flush_mmu >> >> > =3D> zap_pte_range >> >> > =3D> unmap_page_range >> >> > =3D> unmap_single_vma >> >> > =3D> unmap_vmas >> >> > =3D> exit_mmap >> >> > =3D> mmput >> >> > =3D> do_exit >> >> > =3D> do_group_exit >> >> > =3D> get_signal >> >> > =3D> arch_do_signal_or_restart >> >> > =3D> exit_to_user_mode_prepare >> >> > =3D> syscall_exit_to_user_mode >> >> > =3D> do_syscall_64 >> >> > =3D> entry_SYSCALL_64_after_hwframe >> >> > >> >> > The servers experiencing these issues are equipped with impressive >> >> > hardware specifications, including 256 CPUs and 1TB of memory, all >> >> > within a single NUMA node. The zoneinfo is as follows, >> >> > >> >> > Node 0, zone Normal >> >> > pages free 144465775 >> >> > boost 0 >> >> > min 1309270 >> >> > low 1636587 >> >> > high 1963904 >> >> > spanned 564133888 >> >> > present 296747008 >> >> > managed 291974346 >> >> > cma 0 >> >> > protection: (0, 0, 0, 0) >> >> > ... >> >> > pagesets >> >> > cpu: 0 >> >> > count: 2217 >> >> > high: 6392 >> >> > batch: 63 >> >> > vm stats threshold: 125 >> >> > cpu: 1 >> >> > count: 4510 >> >> > high: 6392 >> >> > batch: 63 >> >> > vm stats threshold: 125 >> >> > cpu: 2 >> >> > count: 3059 >> >> > high: 6392 >> >> > batch: 63 >> >> > >> >> > ... >> >> > >> >> > The pcp high is around 100 times the batch size. >> >> > >> >> > I also traced the latency associated with the free_pcppages_bulk() >> >> > function during the container exit process: >> >> > >> >> > nsecs : count distribution >> >> > 0 -> 1 : 0 | = | >> >> > 2 -> 3 : 0 | = | >> >> > 4 -> 7 : 0 | = | >> >> > 8 -> 15 : 0 | = | >> >> > 16 -> 31 : 0 | = | >> >> > 32 -> 63 : 0 | = | >> >> > 64 -> 127 : 0 | = | >> >> > 128 -> 255 : 0 | = | >> >> > 256 -> 511 : 148 |***************** = | >> >> > 512 -> 1023 : 334 |******************************= **********| >> >> > 1024 -> 2047 : 33 |*** = | >> >> > 2048 -> 4095 : 5 | = | >> >> > 4096 -> 8191 : 7 | = | >> >> > 8192 -> 16383 : 12 |* = | >> >> > 16384 -> 32767 : 30 |*** = | >> >> > 32768 -> 65535 : 21 |** = | >> >> > 65536 -> 131071 : 15 |* = | >> >> > 131072 -> 262143 : 27 |*** = | >> >> > 262144 -> 524287 : 84 |********** = | >> >> > 524288 -> 1048575 : 203 |************************ = | >> >> > 1048576 -> 2097151 : 284 |******************************= **** | >> >> > 2097152 -> 4194303 : 327 |******************************= ********* | >> >> > 4194304 -> 8388607 : 215 |************************* = | >> >> > 8388608 -> 16777215 : 116 |************* = | >> >> > 16777216 -> 33554431 : 47 |***** = | >> >> > 33554432 -> 67108863 : 8 | = | >> >> > 67108864 -> 134217727 : 3 | = | >> >> > >> >> > The latency can reach tens of milliseconds. >> >> > >> >> > Experimenting >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> > >> >> > vm.percpu_pagelist_high_fraction >> >> > -------------------------------- >> >> > >> >> > The kernel version currently deployed in our production environment= is the >> >> > stable 6.1.y, and my initial strategy involves optimizing the >> >> >> >> IMHO, we should focus on upstream activity in the cover letter and pa= tch >> >> description. And I don't think that it's necessary to describe the >> >> alternative solution with too much details. >> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the value= of >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size = during >> >> > page draining, which subsequently leads to a substantial reduction = in >> >> > latency. After setting the sysctl value to 0x7fffffff, I observed a= notable >> >> > improvement in latency. >> >> > >> >> > nsecs : count distribution >> >> > 0 -> 1 : 0 | = | >> >> > 2 -> 3 : 0 | = | >> >> > 4 -> 7 : 0 | = | >> >> > 8 -> 15 : 0 | = | >> >> > 16 -> 31 : 0 | = | >> >> > 32 -> 63 : 0 | = | >> >> > 64 -> 127 : 0 | = | >> >> > 128 -> 255 : 120 | = | >> >> > 256 -> 511 : 365 |* = | >> >> > 512 -> 1023 : 201 | = | >> >> > 1024 -> 2047 : 103 | = | >> >> > 2048 -> 4095 : 84 | = | >> >> > 4096 -> 8191 : 87 | = | >> >> > 8192 -> 16383 : 4777 |************** = | >> >> > 16384 -> 32767 : 10572 |******************************= * | >> >> > 32768 -> 65535 : 13544 |******************************= **********| >> >> > 65536 -> 131071 : 12723 |******************************= ******* | >> >> > 131072 -> 262143 : 8604 |************************* = | >> >> > 262144 -> 524287 : 3659 |********** = | >> >> > 524288 -> 1048575 : 921 |** = | >> >> > 1048576 -> 2097151 : 122 | = | >> >> > 2097152 -> 4194303 : 5 | = | >> >> > >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also decre= ase the >> >> > pcp high watermark size to a minimum of four times the batch size. = While >> >> > this could theoretically affect throughput, as highlighted by Ying[= 0], we >> >> > have yet to observe any significant difference in throughput within= our >> >> > production environment after implementing this change. >> >> > >> >> > Backporting the series "mm: PCP high auto-tuning" >> >> > ------------------------------------------------- >> >> >> >> Again, not upstream activity. We can describe the upstream behavior >> >> directly. >> > >> > Andrew has requested that I provide a more comprehensive analysis of >> > this issue, and in response, I have endeavored to outline all the >> > pertinent details in a thorough and detailed manner. >> >> IMHO, upstream activity can provide comprehensive analysis of the issue >> too. And, your patch has changed much from the first version. It's >> better to describe your current version. > > After backporting the pcp auto-tuning feature to the 6.1.y branch, the > code is almost the same with the upstream kernel wrt the pcp. I have > thoroughly documented the detailed data showcasing the changes in the > backported version, providing a clear picture of the results. However, > it's crucial to note that I am unable to directly run the upstream > kernel on our production environment due to practical constraints. IMHO, the patch is for upstream kernel, not some downstream kernel, so focus should be the upstream activity. The issue of the upstream kernel, and how to resolve it. The production environment test results can be used to support the upstream change. >> >> > My second endeavor was to backport the series titled >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual patc= hes, >> >> > into our 6.1.y stable kernel version. Subsequent to its deployment = in our >> >> > production environment, I noted a pronounced reduction in latency. = The >> >> > observed outcomes are as enumerated below: >> >> > >> >> > nsecs : count distribution >> >> > 0 -> 1 : 0 | = | >> >> > 2 -> 3 : 0 | = | >> >> > 4 -> 7 : 0 | = | >> >> > 8 -> 15 : 0 | = | >> >> > 16 -> 31 : 0 | = | >> >> > 32 -> 63 : 0 | = | >> >> > 64 -> 127 : 0 | = | >> >> > 128 -> 255 : 0 | = | >> >> > 256 -> 511 : 0 | = | >> >> > 512 -> 1023 : 0 | = | >> >> > 1024 -> 2047 : 2 | = | >> >> > 2048 -> 4095 : 11 | = | >> >> > 4096 -> 8191 : 3 | = | >> >> > 8192 -> 16383 : 1 | = | >> >> > 16384 -> 32767 : 2 | = | >> >> > 32768 -> 65535 : 7 | = | >> >> > 65536 -> 131071 : 198 |********* = | >> >> > 131072 -> 262143 : 530 |************************ = | >> >> > 262144 -> 524287 : 824 |******************************= ******** | >> >> > 524288 -> 1048575 : 852 |******************************= **********| >> >> > 1048576 -> 2097151 : 714 |******************************= *** | >> >> > 2097152 -> 4194303 : 389 |****************** = | >> >> > 4194304 -> 8388607 : 143 |****** = | >> >> > 8388608 -> 16777215 : 29 |* = | >> >> > 16777216 -> 33554431 : 1 | = | >> >> > >> >> > Compared to the previous data, the maximum latency has been reduced= to >> >> > less than 30ms. >> >> >> >> People don't care too much about page freeing latency during processes >> >> exiting. Instead, they care more about the process exiting time, that >> >> is, throughput. So, it's better to show the page allocation latency >> >> which is affected by the simultaneous processes exiting. >> > >> > I'm confused also. Is this issue really hard to understand ? >> >> IMHO, it's better to prove the issue directly. If you cannot prove it >> directly, you can try alternative one and describe why. > > Not all data can be verified straightforwardly or effortlessly. The > primary focus lies in the zone->lock contention, which necessitates > measuring the latency it incurs. To accomplish this, the > free_pcppages_bulk() function serves as an effective tool for > evaluation. Therefore, I have opted to specifically measure the > latency associated with free_pcppages_bulk(). > > The rationale behind not measuring allocation latency is due to the > necessity of finding a willing participant to endure potential delays, > a task that proved unsuccessful as no one expressed interest. In > contrast, assessing free_pcppages_bulk()'s latency solely requires > identifying and experimenting with the source causing the delays, > making it a more feasible approach. Can you run a benchmark program that do quite some memory allocation by yourself to test it? -- Best Regards, Huang, Ying