From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE0B6C3064D for ; Tue, 2 Jul 2024 09:10:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1CC296B00AD; Tue, 2 Jul 2024 05:10:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 17D166B00AE; Tue, 2 Jul 2024 05:10:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EE8126B00AF; Tue, 2 Jul 2024 05:10:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CCC656B00AD for ; Tue, 2 Jul 2024 05:10:38 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8A55A1C3017 for ; Tue, 2 Jul 2024 09:10:38 +0000 (UTC) X-FDA: 82294242156.16.0210780 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21]) by imf29.hostedemail.com (Postfix) with ESMTP id AF5A912001A for ; Tue, 2 Jul 2024 09:10:35 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FsFSkkfA; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719911419; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B/3se6WSFfJxhNlwmqJUR72U4/KtIL8Lw1SrgwfyKw4=; b=V2cfcG1Kl4Aeagb1YDmu0+sa8Nen+EFYrwLgNGPO80T8K28UOQ8W7UnGLJDk9NcXdu/9en Er8zz0ygt+fmm1iZISsiiYrr7D+VgfLOdkbRSZNHWiuE+p3m5/tFPJPNCdtwFTmqlm/wSB YyCBbAQ9/YqH9nrgDGQH++mLRtfWi2g= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FsFSkkfA; spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719911419; a=rsa-sha256; cv=none; b=VkK+FiO41QKjV5AvajPkC8yCcvqK76V5rOBmKHZ8za7lW0kLZ4HdZUWRKP5oTQ5HqiekdS vI7pgfKYPhPaHf94EcKqo1wr/DMmt6TEOguTOBfNOLLk7EjzBfLSzeNaAI6q31XRE8jO+s 8bfzlmEw6gyohFA3m/wyhfTZlmL0KYA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1719911436; x=1751447436; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=aQGL5PSWat8COY6Svrb42MPoHBAVDZvuKBp+1ZaY5/s=; b=FsFSkkfAm7FabqkDbsJboN3/ItAk70/6mhQ0qWaB+6fbtXzP1OirBS2J qW6VA19nktjqk44bHZuVSHITrfy12GquuNGO1BaJmbIqvRq51c+cXq/7J wE0oTFhA5YY1rXt6ozC104BJ1A27oqTBRLbZtWqb9smXtBgnoskYZGvTY 1SZbOnyF7iV/10vuNkatG+nJBmoED1yZoqnPvsdwG5j24m4ym2mv32m0m tjO1Z+gmDaxCIavEsb2rwT6aD7KSlPFwCyibx0ipan15ESmLLkSW4v5ss bYZsPTw2V51fUbSrk+ygYX46j/HTFr7qhzlDTuSU2zHY61QjhpeejG0ms A==; X-CSE-ConnectionGUID: Ffr8ggVwRqiRaCNHJVEf+A== X-CSE-MsgGUID: 0pcB/gYLST6xdmrdnHC07A== X-IronPort-AV: E=McAfee;i="6700,10204,11120"; a="17027212" X-IronPort-AV: E=Sophos;i="6.09,178,1716274800"; d="scan'208";a="17027212" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 02:10:35 -0700 X-CSE-ConnectionGUID: bjkFepklSImpnG7J6juCfw== X-CSE-MsgGUID: el0dixGOSnyesWQSQtuOzg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,178,1716274800"; d="scan'208";a="45966018" Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 02:10:34 -0700 From: "Huang, Ying" To: Yafang Shao Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist In-Reply-To: (Yafang Shao's message of "Tue, 2 Jul 2024 14:37:38 +0800") References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> Date: Tue, 02 Jul 2024 17:08:42 +0800 Message-ID: <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: mo5refjfpm36pyjqcxwedshd1n49rfx9 X-Rspam-User: X-Rspamd-Queue-Id: AF5A912001A X-Rspamd-Server: rspam02 X-HE-Tag: 1719911435-936375 X-HE-Meta: U2FsdGVkX1/pT9NsQLxG6ATcaOSQg9GIy3fIMTXJNIQPENSrXaNXFOOF5HWOmzIyJllDpbao7dEPKZt27WcoQkQqYoUGrWdp6sBbcv3raa1ARpYSxKCtJBiFJVydQR/FQ4WRAZrVSwpq/0YF756iR/Vcny0KlwdgLusNcCZJy1IQHKTAizJHotcXhSReEdlhih4e32+Wr9mPPyhoXKRnByNVmzodnBEg/MYWOGL2TW6nLgX7NAD8V70vnSv4IbGqKzbOn0cwwIeAlhjB9MCLwDHojQ6KbrkSz4F3poL5jzKhAgdiRW+k1eh/TAPQ0qUU9X6N3DpkdZGey/wryCS3VmVdN/QOWXZ941IpUjYELvL3mYYdySpc/am8RdW6/I12HkRBjyFtp2v5S43UYmM7NcaxD2esWbRPEz+LZcSZBuIeW8ZXsjlYXgvIdsITBDXKS8IBlSjlSVsljRJ+vhAWGMUMTAzCGCxWizbVdRvP4Br6G24vn1JTSZBRjer5gybafxBr1kKFUA2OIweir5t428UPQ1Jp9n4buHT3qtiEoaNVHaoRTiErEyu95YRdtGjKQ1bbepz1lEVjqJcOyJzIPIhFB6FfTSPSbUC1RaU3sa924dwZO4hMmDd4ngHQcFzG8dKo62E/yPxGifn79ZB0Ne5Wq/TDj0/ll+7fSlicsPwqlU3fvpFO8K7wtxzU/z1nC35grqnyYM3lr6qEruRk4BM0qBHGRlmInMc0jGAPZ51EhLqG7RepqyvZ2LjVSdOrC+ZJcrpl2qBzhXtcjozDoMB1L8FPAkXSJNxyw+THLzgBXMAcUfNYhk72NRhs44Nr+RGJLRyr7e8lCn5NQqP9Y2cK6skHNlik0Vrs3Lh8yDwfsjo34jmMLcdHskQXwWB9ldxOg7cHDsGtzrDV898hcAvId9BXDZ1IYnr/bY6IBAkt19M//GP9zLxv8CPSEAoGQc4wd2bAJE9CbiTlhxy MY/3Qr/f IDIfQqvAZ+h9YYtuNPpkjJCwzyoxPlhNGFD/nuDct2ZTdFUL5btfVD8+4E/XvwOdzIjjp4Z9Je39YZl6URn2nYNXHBm6dZr/DN8v3/4Bjrbaj71lHDnF3/Cr8f5/zfQS3XzFyWiqEJIHIlAPfihEcjZH+70HK9ll0bedvMeQC6gRXZkbCiYRudPYNtZkm4d0LfXxeUVNEZtJpNUyBPdMNdFGvKR/CAhEn1Bm6aj26S1r5n6wAn8V1QW6YM1ytxeHJjGMq4YdmaKGyi3jLrpomgwnEGzi6y9DexzbzDadHyairxe987LaI0LnYobyzPSy9M1i+vrgKVkV9r/6482Q7S5QoTw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yafang Shao writes: > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wr= ote: >> >> > Currently, we're encountering latency spikes in our container environm= ent >> > when a specific container with multiple Python-based tasks exits. These >> > tasks may hold the zone->lock for an extended period, significantly >> > impacting latency for other containers attempting to allocate memory. >> >> Is this locking issue well understood? Is anyone working on it? A >> reasonably detailed description of the issue and a description of any >> ongoing work would be helpful here. > > In our containerized environment, we have a specific type of container > that runs 18 processes, each consuming approximately 6GB of RSS. These > processes are organized as separate processes rather than threads due > to the Python Global Interpreter Lock (GIL) being a bottleneck in a > multi-threaded setup. Upon the exit of these containers, other > containers hosted on the same machine experience significant latency > spikes. > > Our investigation using perf tracing revealed that the root cause of > these spikes is the simultaneous execution of exit_mmap() by each of > the exiting processes. This concurrent access to the zone->lock > results in contention, which becomes a hotspot and negatively impacts > performance. The perf results clearly indicate this contention as a > primary contributor to the observed latency issues. > > + 77.02% 0.00% uwsgi [kernel.kallsyms] > [k] mmput =E2=96=92 > - 76.98% 0.01% uwsgi [kernel.kallsyms] > [k] exit_mmap =E2=96=92 > - 76.97% exit_mmap > =E2=96=92 > - 58.58% unmap_vmas > =E2=96=92 > - 58.55% unmap_single_vma > =E2=96=92 > - unmap_page_range > =E2=96=92 > - 58.32% zap_pte_range > =E2=96=92 > - 42.88% tlb_flush_mmu > =E2=96=92 > - 42.76% free_pages_and_swap_cache > =E2=96=92 > - 41.22% release_pages > =E2=96=92 > - 33.29% free_unref_page_list > =E2=96=92 > - 32.37% free_unref_page_commit > =E2=96=92 > - 31.64% free_pcppages_bulk > =E2=96=92 > + 28.65% _raw_spin_lock > =E2=96=92 > 1.28% __list_del_entry_valid > =E2=96=92 > + 3.25% folio_lruvec_lock_irqsave > =E2=96=92 > + 0.75% __mem_cgroup_uncharge_list > =E2=96=92 > 0.60% __mod_lruvec_state > =E2=96=92 > 1.07% free_swap_cache > =E2=96=92 > + 11.69% page_remove_rmap > =E2=96=92 > 0.64% __mod_lruvec_page_state > - 17.34% remove_vma > =E2=96=92 > - 17.25% vm_area_free > =E2=96=92 > - 17.23% kmem_cache_free > =E2=96=92 > - 17.15% __slab_free > =E2=96=92 > - 14.56% discard_slab > =E2=96=92 > free_slab > =E2=96=92 > __free_slab > =E2=96=92 > __free_pages > =E2=96=92 > - free_unref_page > =E2=96=92 > - 13.50% free_unref_page_commit > =E2=96=92 > - free_pcppages_bulk > =E2=96=92 > + 13.44% _raw_spin_lock > > By enabling the mm_page_pcpu_drain() we can find the detailed stack: > > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp > e=3D1 > <...>-1540432 [224] d..3. 618048.023887: > =3D> free_pcppages_bulk > =3D> free_unref_page_commit > =3D> free_unref_page_list > =3D> release_pages > =3D> free_pages_and_swap_cache > =3D> tlb_flush_mmu > =3D> zap_pte_range > =3D> unmap_page_range > =3D> unmap_single_vma > =3D> unmap_vmas > =3D> exit_mmap > =3D> mmput > =3D> do_exit > =3D> do_group_exit > =3D> get_signal > =3D> arch_do_signal_or_restart > =3D> exit_to_user_mode_prepare > =3D> syscall_exit_to_user_mode > =3D> do_syscall_64 > =3D> entry_SYSCALL_64_after_hwframe > > The servers experiencing these issues are equipped with impressive > hardware specifications, including 256 CPUs and 1TB of memory, all > within a single NUMA node. The zoneinfo is as follows, > > Node 0, zone Normal > pages free 144465775 > boost 0 > min 1309270 > low 1636587 > high 1963904 > spanned 564133888 > present 296747008 > managed 291974346 > cma 0 > protection: (0, 0, 0, 0) > ... > ... > pagesets > cpu: 0 > count: 2217 > high: 6392 > batch: 63 > vm stats threshold: 125 > cpu: 1 > count: 4510 > high: 6392 > batch: 63 > vm stats threshold: 125 > cpu: 2 > count: 3059 > high: 6392 > batch: 63 > > ... > > The high is around 100 times the batch size. > > We also traced the latency associated with the free_pcppages_bulk() > function during the container exit process: > > 19:48:54 > nsecs : count distribution > 0 -> 1 : 0 | = | > 2 -> 3 : 0 | = | > 4 -> 7 : 0 | = | > 8 -> 15 : 0 | = | > 16 -> 31 : 0 | = | > 32 -> 63 : 0 | = | > 64 -> 127 : 0 | = | > 128 -> 255 : 0 | = | > 256 -> 511 : 148 |***************** = | > 512 -> 1023 : 334 |************************************= ****| > 1024 -> 2047 : 33 |*** = | > 2048 -> 4095 : 5 | = | > 4096 -> 8191 : 7 | = | > 8192 -> 16383 : 12 |* = | > 16384 -> 32767 : 30 |*** = | > 32768 -> 65535 : 21 |** = | > 65536 -> 131071 : 15 |* = | > 131072 -> 262143 : 27 |*** = | > 262144 -> 524287 : 84 |********** = | > 524288 -> 1048575 : 203 |************************ = | > 1048576 -> 2097151 : 284 |********************************** = | > 2097152 -> 4194303 : 327 |************************************= *** | > 4194304 -> 8388607 : 215 |************************* = | > 8388608 -> 16777215 : 116 |************* = | > 16777216 -> 33554431 : 47 |***** = | > 33554432 -> 67108863 : 8 | = | > 67108864 -> 134217727 : 3 | = | > > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 > > The latency can reach tens of milliseconds. > > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the > minimum pagelist high at 4 times the batch size, we were able to > significantly reduce the latency associated with the > free_pcppages_bulk() function during container exits.: > > nsecs : count distribution > 0 -> 1 : 0 | = | > 2 -> 3 : 0 | = | > 4 -> 7 : 0 | = | > 8 -> 15 : 0 | = | > 16 -> 31 : 0 | = | > 32 -> 63 : 0 | = | > 64 -> 127 : 0 | = | > 128 -> 255 : 120 | = | > 256 -> 511 : 365 |* = | > 512 -> 1023 : 201 | = | > 1024 -> 2047 : 103 | = | > 2048 -> 4095 : 84 | = | > 4096 -> 8191 : 87 | = | > 8192 -> 16383 : 4777 |************** = | > 16384 -> 32767 : 10572 |******************************* = | > 32768 -> 65535 : 13544 |************************************= ****| > 65536 -> 131071 : 12723 |************************************= * | > 131072 -> 262143 : 8604 |************************* = | > 262144 -> 524287 : 3659 |********** = | > 524288 -> 1048575 : 921 |** = | > 1048576 -> 2097151 : 122 | = | > 2097152 -> 4194303 : 5 | = | > > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 > > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl > knob to set the minimum pagelist high at a level that effectively > mitigated latency issues, we observed that other containers were no > longer experiencing similar complaints. As a result, we decided to > implement this tuning as a permanent workaround and have deployed it > across all clusters of servers where these containers may be deployed. Thanks for your detailed data. IIUC, the latency of free_pcppages_bulk() during process exiting shouldn't be a problem? Because users care more about the total time of process exiting, that is, throughput. And I suspect that the zone->lock contention and page allocating/freeing throughput will be worse with your configuration? But the latency of free_pcppages_bulk() and page allocation in other processes is a problem. And your configuration can help it. Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that way, you have a normal PCP size (high) but smaller PCP batch. I guess that may help both latency and throughput in your system. Could you give it a try? [snip] -- Best Regards, Huang, Ying