From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95C02C30658 for ; Fri, 5 Jul 2024 05:33:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE61F6B0098; Fri, 5 Jul 2024 01:33:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C95CE6B0099; Fri, 5 Jul 2024 01:33:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B11016B009A; Fri, 5 Jul 2024 01:33:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8AAEB6B0098 for ; Fri, 5 Jul 2024 01:33:03 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 07BFF121615 for ; Fri, 5 Jul 2024 05:33:03 +0000 (UTC) X-FDA: 82304580246.29.75AE26A Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf20.hostedemail.com (Postfix) with ESMTP id AB4CF1C0015 for ; Fri, 5 Jul 2024 05:33:00 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UkJ9jUbw; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720157567; a=rsa-sha256; cv=none; b=Nvq6MOlWDNDKMavvGl39MdXJaghG7qUykWT9p5E0VJ8dtJTZuPewNIt/4RvuYqGnNrtxSQ o41/YMs7T3f5YCp9bUwSjc6/r010zHKcfnuxl5Xo0bPCrmfRp1Yxjt08PjN60QXFu8Qmdb /Scp76ft9OesKIR4B/RbMrzXARTFLNw= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UkJ9jUbw; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720157567; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vZ1RxkoFSJBXUu728sDmmouRyvcSBeQiMbSX2ikLNIA=; b=gzTjWCEuPHAZxegRBAOr1ae+9RwLXR+4XT2pnwFfAMgJiQgmUWGQRi1+fBWx/CqqMEpKeM XnqpujHUSe74PHV1O8QxUfjWjtDejqnesroyfqFLDqwexuwnDgmfmxB8a0xUq3Ih46xxI/ zXSQWo0nntJ+ASgUthMYsZ6+WCs0UyI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1720157581; x=1751693581; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=YJ/Hlwr4LLEB0WMCKdIBCLqxhiUpu+Lf/f1yXxQh9y4=; b=UkJ9jUbwsJn29DG48j/hlpZ1E0bJy7FqTg7wxTbyxOEJeiui6kVzvjHv ER3fcIbr2GOO0nN6i9wFuMd2c93sm1vxzGdYEhzmkKaKNvjRntEiTorfV C+EZvo0lDbvyvee/i+ei1jsotcvyT/8CryoYlyejZZm9GpG5KFxv7KQif 9Nbj2AI13CA9IU1Pufgwi7WC5vsfUmHeIGln1LfvqNKF3j59LCJLrXZTn BOQ3M7Ay2VC7r19w9Z+f895QXQcX7UWnplrnwqDNlokjwU5xamLQVvmEc 0R4j5NH5mV1dHb72FQUX2kd6EH5OWdBGs2xOAYxtBlL1qFR8X95Wj4hi5 A==; X-CSE-ConnectionGUID: LtfPAuwdTdeLEzdOIClw3Q== X-CSE-MsgGUID: SU1MQlPtRWW/gMTtiDuUSw== X-IronPort-AV: E=McAfee;i="6700,10204,11123"; a="28042112" X-IronPort-AV: E=Sophos;i="6.09,184,1716274800"; d="scan'208";a="28042112" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 22:33:00 -0700 X-CSE-ConnectionGUID: tM8LoPUkRpiUDCLhZzU/3A== X-CSE-MsgGUID: EH+Zk8qXQ72oeqnFgJg54g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,184,1716274800"; d="scan'208";a="47207674" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 22:32:58 -0700 From: "Huang, Ying" To: Yafang Shao Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist In-Reply-To: (Yafang Shao's message of "Fri, 5 Jul 2024 11:03:37 +0800") References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 05 Jul 2024 13:31:06 +0800 Message-ID: <87cynse76t.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: stoeim53jfmcqbrr4t6f589d7m3w96fo X-Rspamd-Queue-Id: AB4CF1C0015 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1720157580-113155 X-HE-Meta: U2FsdGVkX19ZKnqLC4JRm9FZP8UynRGkf1QaPZuqYuyxrC3yyxWx2HkmxKXS/2uvWfFTvNhSxyDFi5PrdGjM8O5LvlrorXdJbndOH78zvy4MQETGge/+NGlNHDzu8KyB9eNvIQZ9eQhbYBzGxAXoUuZpHQwJ3DVY487ddlup6SNgxrvqABDAURjmo2hKDyS0hAHzXsHoXb3goSXPrBGD0YV7NZrsEVFwuXbliaY5rXpKSP2ZiXIakfoufcKa5PaupILeo+ZUEx2N3dYdIUhoxSkrR9h8EeRO2SmNf50ZwgZywWJrLaZELi8Q85ZyNV8Mmf2f2BEOCQUPFraLRaGsLLbXco2ilKW8X6GJ7vjYUEhsWQfzlexJAWtsA9smQbOeuuVws0IhLCO1xKyylK51EtXoDm7akM43wX5qzR9au1CFm86ChUu7wLmjeZ6GyQB5i5IjzZPVP+nHGheAwiTrI4/AZI61jKUg7Ye6+Pqih9UXxT3xXWdwBUHVIaLCt8QPQZ0HELhdcetKzkKtHuO6yAAwieoEssMim8zmU9uY90TIKKAPUenmp9Ij36Vz8Ma5Hc40KOzd9SNeJiPOv2xBpoOpSB7Iu+Pp7HavVHiNVAM5nnfhYVjo1EemLUf7orGNeYOEjYaaUKzUIzG+1NvqqGSNlpxKFXdGJj29jGpZGdmANDDrKAUeN7l/bUbpYuRaVQ/S2bCzPHJXRE/KISCy3yi6xQpKzm6D9MIWcGQcgOK5ASOWJFuw2MhtheaXXJPwPR6c16C6t981+GbfHKnUMoa/J3AkSuoTZNUE1/4MeGNu8Uw9NJNprLk999ARkDntMn4BRiGIjfE5pi1/Xi74VLKWqZT7A4F1Nv2JZD3nbQIHWhiOx8GBxVQCw+VLVr5X122s57ji0/PO7PXtYd4PbM+g9ar1vlfEM+6NUSwNk4rJMvEHC2c5hinbzIWHVDhnMHfyJVGxmjrnW/b8esP nwvRWwtS pbFWkUoEsxGlm2RvUUwGAp6zKNujZHfauKhs0oRhBBiWQw92XivHHQr2uxzhors70Y4xXApWu7UaqmMu3NcwbC2UVZGDgfTX8Ihr51LiaFPZSog7rZf6qceETZ/yj8jxwSfkBnzX4AhziYOpSZ9ZT+LkId+JWuyqAO1RiLGNtOOUwCx8MP3xi7iSA5bRrNtkSaL0+FOOJ0s9kjSCtwT3MI0m5f2KaeJjejEUTthFKo9TRnM8prixCLH2DHpOR3d9mgOxAmCWLBvhg2ZBKoMKHDB+J+1xdTbXygcMqpHwPwzOjq20Q4VaPMV1OZbOaIFxIk8yAZIwHUj/frIlaI0eKKxTrO62jtxbh12fVJPabkXvVgvSakt0Vr/JLNA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Fri, Jul 5, 2024 at 9:30=E2=80=AFAM Huang, Ying = wrote: >> >> Yafang Shao writes: >> >> > On Wed, Jul 3, 2024 at 1:36=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Yafang Shao writes: >> >> >> >> > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying wrote: >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying wrote: >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> >> >> >> >> Yafang Shao writes: >> >> >> >> >> >> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wrote: >> >> >> >> >> >> >> >> >> >> >> >> > Currently, we're encountering latency spikes in our con= tainer environment >> >> >> >> >> >> > when a specific container with multiple Python-based ta= sks exits. These >> >> >> >> >> >> > tasks may hold the zone->lock for an extended period, s= ignificantly >> >> >> >> >> >> > impacting latency for other containers attempting to al= locate memory. >> >> >> >> >> >> >> >> >> >> >> >> Is this locking issue well understood? Is anyone working= on it? A >> >> >> >> >> >> reasonably detailed description of the issue and a descri= ption of any >> >> >> >> >> >> ongoing work would be helpful here. >> >> >> >> >> > >> >> >> >> >> > In our containerized environment, we have a specific type = of container >> >> >> >> >> > that runs 18 processes, each consuming approximately 6GB o= f RSS. These >> >> >> >> >> > processes are organized as separate processes rather than = threads due >> >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottle= neck in a >> >> >> >> >> > multi-threaded setup. Upon the exit of these containers, o= ther >> >> >> >> >> > containers hosted on the same machine experience significa= nt latency >> >> >> >> >> > spikes. >> >> >> >> >> > >> >> >> >> >> > Our investigation using perf tracing revealed that the roo= t cause of >> >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() = by each of >> >> >> >> >> > the exiting processes. This concurrent access to the zone-= >lock >> >> >> >> >> > results in contention, which becomes a hotspot and negativ= ely impacts >> >> >> >> >> > performance. The perf results clearly indicate this conten= tion as a >> >> >> >> >> > primary contributor to the observed latency issues. >> >> >> >> >> > >> >> >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] >> >> >> >> >> > [k] mmput =E2= =96=92 >> >> >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] >> >> >> >> >> > [k] exit_mmap =E2= =96=92 >> >> >> >> >> > - 76.97% exit_mmap >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 58.58% unmap_vmas >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 58.55% unmap_single_vma >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - unmap_page_range >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 58.32% zap_pte_range >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 42.88% tlb_flush_mmu >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 42.76% free_pages_and_swap_cache >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 41.22% release_pages >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 33.29% free_unref_page_list >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 32.37% free_unref_page_com= mit >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 31.64% free_pcppages_bu= lk >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > + 28.65% _raw_spin_lock >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > 1.28% __list_del_ent= ry_valid >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > + 3.25% folio_lruvec_lock_irqsa= ve >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > + 0.75% __mem_cgroup_uncharge_l= ist >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > 0.60% __mod_lruvec_state >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > 1.07% free_swap_cache >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > + 11.69% page_remove_rmap >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > 0.64% __mod_lruvec_page_state >> >> >> >> >> > - 17.34% remove_vma >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 17.25% vm_area_free >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 17.23% kmem_cache_free >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 17.15% __slab_free >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 14.56% discard_slab >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > free_slab >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > __free_slab >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > __free_pages >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - free_unref_page >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - 13.50% free_unref_page_commit >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > - free_pcppages_bulk >> >> >> >> >> > =E2= =96=92 >> >> >> >> >> > + 13.44% _raw_spin_lock >> >> >> >> >> > >> >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detai= led stack: >> >> >> >> >> > >> >> >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page= _pcpu_drain: >> >> >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migrate= typ >> >> >> >> >> > e=3D1 >> >> >> >> >> > <...>-1540432 [224] d..3. 618048.023887: >> >> >> >> >> > =3D> free_pcppages_bulk >> >> >> >> >> > =3D> free_unref_page_commit >> >> >> >> >> > =3D> free_unref_page_list >> >> >> >> >> > =3D> release_pages >> >> >> >> >> > =3D> free_pages_and_swap_cache >> >> >> >> >> > =3D> tlb_flush_mmu >> >> >> >> >> > =3D> zap_pte_range >> >> >> >> >> > =3D> unmap_page_range >> >> >> >> >> > =3D> unmap_single_vma >> >> >> >> >> > =3D> unmap_vmas >> >> >> >> >> > =3D> exit_mmap >> >> >> >> >> > =3D> mmput >> >> >> >> >> > =3D> do_exit >> >> >> >> >> > =3D> do_group_exit >> >> >> >> >> > =3D> get_signal >> >> >> >> >> > =3D> arch_do_signal_or_restart >> >> >> >> >> > =3D> exit_to_user_mode_prepare >> >> >> >> >> > =3D> syscall_exit_to_user_mode >> >> >> >> >> > =3D> do_syscall_64 >> >> >> >> >> > =3D> entry_SYSCALL_64_after_hwframe >> >> >> >> >> > >> >> >> >> >> > The servers experiencing these issues are equipped with im= pressive >> >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of mem= ory, all >> >> >> >> >> > within a single NUMA node. The zoneinfo is as follows, >> >> >> >> >> > >> >> >> >> >> > Node 0, zone Normal >> >> >> >> >> > pages free 144465775 >> >> >> >> >> > boost 0 >> >> >> >> >> > min 1309270 >> >> >> >> >> > low 1636587 >> >> >> >> >> > high 1963904 >> >> >> >> >> > spanned 564133888 >> >> >> >> >> > present 296747008 >> >> >> >> >> > managed 291974346 >> >> >> >> >> > cma 0 >> >> >> >> >> > protection: (0, 0, 0, 0) >> >> >> >> >> > ... >> >> >> >> >> > ... >> >> >> >> >> > pagesets >> >> >> >> >> > cpu: 0 >> >> >> >> >> > count: 2217 >> >> >> >> >> > high: 6392 >> >> >> >> >> > batch: 63 >> >> >> >> >> > vm stats threshold: 125 >> >> >> >> >> > cpu: 1 >> >> >> >> >> > count: 4510 >> >> >> >> >> > high: 6392 >> >> >> >> >> > batch: 63 >> >> >> >> >> > vm stats threshold: 125 >> >> >> >> >> > cpu: 2 >> >> >> >> >> > count: 3059 >> >> >> >> >> > high: 6392 >> >> >> >> >> > batch: 63 >> >> >> >> >> > >> >> >> >> >> > ... >> >> >> >> >> > >> >> >> >> >> > The high is around 100 times the batch size. >> >> >> >> >> > >> >> >> >> >> > We also traced the latency associated with the free_pcppag= es_bulk() >> >> >> >> >> > function during the container exit process: >> >> >> >> >> > >> >> >> >> >> > 19:48:54 >> >> >> >> >> > nsecs : count distribution >> >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> >> > 128 -> 255 : 0 | = | >> >> >> >> >> > 256 -> 511 : 148 |***************** = | >> >> >> >> >> > 512 -> 1023 : 334 |*********************= *******************| >> >> >> >> >> > 1024 -> 2047 : 33 |*** = | >> >> >> >> >> > 2048 -> 4095 : 5 | = | >> >> >> >> >> > 4096 -> 8191 : 7 | = | >> >> >> >> >> > 8192 -> 16383 : 12 |* = | >> >> >> >> >> > 16384 -> 32767 : 30 |*** = | >> >> >> >> >> > 32768 -> 65535 : 21 |** = | >> >> >> >> >> > 65536 -> 131071 : 15 |* = | >> >> >> >> >> > 131072 -> 262143 : 27 |*** = | >> >> >> >> >> > 262144 -> 524287 : 84 |********** = | >> >> >> >> >> > 524288 -> 1048575 : 203 |*********************= *** | >> >> >> >> >> > 1048576 -> 2097151 : 284 |*********************= ************* | >> >> >> >> >> > 2097152 -> 4194303 : 327 |*********************= ****************** | >> >> >> >> >> > 4194304 -> 8388607 : 215 |*********************= **** | >> >> >> >> >> > 8388608 -> 16777215 : 116 |************* = | >> >> >> >> >> > 16777216 -> 33554431 : 47 |***** = | >> >> >> >> >> > 33554432 -> 67108863 : 8 | = | >> >> >> >> >> > 67108864 -> 134217727 : 3 | = | >> >> >> >> >> > >> >> >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 >> >> >> >> >> > >> >> >> >> >> > The latency can reach tens of milliseconds. >> >> >> >> >> > >> >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction paramete= r to set the >> >> >> >> >> > minimum pagelist high at 4 times the batch size, we were a= ble to >> >> >> >> >> > significantly reduce the latency associated with the >> >> >> >> >> > free_pcppages_bulk() function during container exits.: >> >> >> >> >> > >> >> >> >> >> > nsecs : count distribution >> >> >> >> >> > 0 -> 1 : 0 | = | >> >> >> >> >> > 2 -> 3 : 0 | = | >> >> >> >> >> > 4 -> 7 : 0 | = | >> >> >> >> >> > 8 -> 15 : 0 | = | >> >> >> >> >> > 16 -> 31 : 0 | = | >> >> >> >> >> > 32 -> 63 : 0 | = | >> >> >> >> >> > 64 -> 127 : 0 | = | >> >> >> >> >> > 128 -> 255 : 120 | = | >> >> >> >> >> > 256 -> 511 : 365 |* = | >> >> >> >> >> > 512 -> 1023 : 201 | = | >> >> >> >> >> > 1024 -> 2047 : 103 | = | >> >> >> >> >> > 2048 -> 4095 : 84 | = | >> >> >> >> >> > 4096 -> 8191 : 87 | = | >> >> >> >> >> > 8192 -> 16383 : 4777 |************** = | >> >> >> >> >> > 16384 -> 32767 : 10572 |*********************= ********** | >> >> >> >> >> > 32768 -> 65535 : 13544 |*********************= *******************| >> >> >> >> >> > 65536 -> 131071 : 12723 |*********************= **************** | >> >> >> >> >> > 131072 -> 262143 : 8604 |*********************= **** | >> >> >> >> >> > 262144 -> 524287 : 3659 |********** = | >> >> >> >> >> > 524288 -> 1048575 : 921 |** = | >> >> >> >> >> > 1048576 -> 2097151 : 122 | = | >> >> >> >> >> > 2097152 -> 4194303 : 5 | = | >> >> >> >> >> > >> >> >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 >> >> >> >> >> > >> >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_frac= tion sysctl >> >> >> >> >> > knob to set the minimum pagelist high at a level that effe= ctively >> >> >> >> >> > mitigated latency issues, we observed that other container= s were no >> >> >> >> >> > longer experiencing similar complaints. As a result, we de= cided to >> >> >> >> >> > implement this tuning as a permanent workaround and have d= eployed it >> >> >> >> >> > across all clusters of servers where these containers may = be deployed. >> >> >> >> >> >> >> >> >> >> Thanks for your detailed data. >> >> >> >> >> >> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exi= ting >> >> >> >> >> shouldn't be a problem? >> >> >> >> > >> >> >> >> > Right. The problem arises when the process holds the lock for= too >> >> >> >> > long, causing other processes that are attempting to allocate= memory >> >> >> >> > to experience delays or wait times. >> >> >> >> > >> >> >> >> >> Because users care more about the total time of >> >> >> >> >> process exiting, that is, throughput. And I suspect that th= e zone->lock >> >> >> >> >> contention and page allocating/freeing throughput will be wo= rse with >> >> >> >> >> your configuration? >> >> >> >> > >> >> >> >> > While reducing throughput may not be a significant concern du= e to the >> >> >> >> > minimal difference, the potential for latency spikes, a cruci= al metric >> >> >> >> > for assessing system stability, is of greater concern to user= s. Higher >> >> >> >> > latency can lead to request errors, impacting the user experi= ence. >> >> >> >> > Therefore, maintaining stability, even at the cost of slightl= y lower >> >> >> >> > throughput, is preferable to experiencing higher throughput w= ith >> >> >> >> > unstable performance. >> >> >> >> > >> >> >> >> >> >> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation = in other >> >> >> >> >> processes is a problem. And your configuration can help it. >> >> >> >> >> >> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In = that way, >> >> >> >> >> you have a normal PCP size (high) but smaller PCP batch. I = guess that >> >> >> >> >> may help both latency and throughput in your system. Could = you give it >> >> >> >> >> a try? >> >> >> >> > >> >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_S= CALE_MAX >> >> >> >> > configuration option. However, I've observed your recent impr= ovements >> >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9= ("mm: >> >> >> >> > restrict the pcp batch scale factor to avoid too long latency= "), which >> >> >> >> > has prompted me to experiment with manually setting the >> >> >> >> > pcp->free_factor to zero. While this adjustment provided some >> >> >> >> > improvement, the results were not as significant as I had hop= ed. >> >> >> >> > >> >> >> >> > BTW, perhaps we should consider the implementation of a sysct= l knob as >> >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allo= w users >> >> >> >> > to more easily adjust it. >> >> >> >> >> >> >> >> If you cannot test upstream behavior, it's hard to make changes= to >> >> >> >> upstream. Could you find a way to do that? >> >> >> > >> >> >> > I'm afraid I can't run an upstream kernel in our production envi= ronment :( >> >> >> > Lots of code changes have to be made. >> >> >> >> >> >> Understand. Can you find a way to test upstream behavior, not ups= tream >> >> >> kernel exactly? Or test the upstream kernel but in a similar but = not >> >> >> exactly production environment. >> >> > >> >> > I'm willing to give it a try, but it may take some time to achieve = the >> >> > desired results.. >> >> >> >> Thanks! >> > >> > After I backported the series "mm: PCP high auto-tuning," which >> > consists of a total of 9 patches, to our 6.1.y stable kernel and >> > deployed it to our production envrionment, I observed a significant >> > reduction in latency. The results are as follows: >> > >> > nsecs : count distribution >> > 0 -> 1 : 0 | = | >> > 2 -> 3 : 0 | = | >> > 4 -> 7 : 0 | = | >> > 8 -> 15 : 0 | = | >> > 16 -> 31 : 0 | = | >> > 32 -> 63 : 0 | = | >> > 64 -> 127 : 0 | = | >> > 128 -> 255 : 0 | = | >> > 256 -> 511 : 0 | = | >> > 512 -> 1023 : 0 | = | >> > 1024 -> 2047 : 2 | = | >> > 2048 -> 4095 : 11 | = | >> > 4096 -> 8191 : 3 | = | >> > 8192 -> 16383 : 1 | = | >> > 16384 -> 32767 : 2 | = | >> > 32768 -> 65535 : 7 | = | >> > 65536 -> 131071 : 198 |********* = | >> > 131072 -> 262143 : 530 |************************ = | >> > 262144 -> 524287 : 824 |*********************************= ***** | >> > 524288 -> 1048575 : 852 |*********************************= *******| >> > 1048576 -> 2097151 : 714 |*********************************= | >> > 2097152 -> 4194303 : 389 |****************** = | >> > 4194304 -> 8388607 : 143 |****** = | >> > 8388608 -> 16777215 : 29 |* = | >> > 16777216 -> 33554431 : 1 | = | >> > >> > avg =3D 1181478 nsecs, total: 4380921824 nsecs, count: 3708 >> > >> > Compared to the previous data, the maximum latency has been reduced to >> > less than 30ms. >> >> That series can reduce the allocation/freeing from/to the buddy system, >> thus reduce the lock contention. >> >> > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max, >> > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning >> > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum >> > latency was further reduced to less than 2ms. >> > >> > nsecs : count distribution >> > 0 -> 1 : 0 | = | >> > 2 -> 3 : 0 | = | >> > 4 -> 7 : 0 | = | >> > 8 -> 15 : 0 | = | >> > 16 -> 31 : 0 | = | >> > 32 -> 63 : 0 | = | >> > 64 -> 127 : 0 | = | >> > 128 -> 255 : 0 | = | >> > 256 -> 511 : 0 | = | >> > 512 -> 1023 : 0 | = | >> > 1024 -> 2047 : 36 | = | >> > 2048 -> 4095 : 5063 |***** = | >> > 4096 -> 8191 : 31226 |******************************** = | >> > 8192 -> 16383 : 37606 |*********************************= ****** | >> > 16384 -> 32767 : 38359 |*********************************= *******| >> > 32768 -> 65535 : 30652 |******************************* = | >> > 65536 -> 131071 : 18714 |******************* = | >> > 131072 -> 262143 : 7968 |******** = | >> > 262144 -> 524287 : 1996 |** = | >> > 524288 -> 1048575 : 302 | = | >> > 1048576 -> 2097151 : 19 | = | >> > >> > avg =3D 40702 nsecs, total: 7002105331 nsecs, count: 172031 >> > >> > After multiple trials, I observed no significant differences between >> > each attempt. >> >> The test results looks good. >> >> > Therefore, we decided to backport your improvements to our local >> > kernel. Additionally, I propose introducing a new sysctl knob, >> > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users >> > to easily tune the setting based on their specific workloads. >> >> The downside is that the pcp->high decaying (in decay_pcp_high()) will >> be slower. That is, it will take longer for idle pages to be freed from >> PCP to buddy. One possible solution is to keep the decaying page >> number, but use a loop as follows to control latency. >> >> while (count < decay_number) { >> spin_lock(); >> free_pcppages_bulk(, batch, ); >> spin_unlock(); >> count -=3D batch; >> if (count) >> cond_resched(); >> } > > I will try it with this additional change. > Thanks for your suggestion. > > IIUC, the additional change should be as follows? > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone, > unsigned int order, > int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) > { > int high_min, to_drain, batch; > - int todo =3D 0; > + int todo =3D 0, count =3D 0; > > high_min =3D READ_ONCE(pcp->high_min); > batch =3D READ_ONCE(pcp->batch); > @@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct > per_cpu_pages *pcp) > * control latency. This caps pcp->high decrement too. > */ > if (pcp->high > high_min) { > - pcp->high =3D max3(pcp->count - (batch << pcp_batch_scale= _max), > + pcp->high =3D max3(pcp->count - (batch << 5), Please avoid to use magic number if possible. Otherwise looks good to me. Thanks! > pcp->high - (pcp->high >> 3), high_min); > if (pcp->high > high_min) > todo++; > } > > to_drain =3D pcp->count - pcp->high; > - if (to_drain > 0) { > + while (count < to_drain) { > spin_lock(&pcp->lock); > - free_pcppages_bulk(zone, to_drain, pcp, 0); > + free_pcppages_bulk(zone, batch, pcp, 0); > spin_unlock(&pcp->lock); > + count +=3D batch; > todo++; > + cond_resched(); > } -- Best Regards, Huang, Ying