From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EA6FC3064D for ; Tue, 2 Jul 2024 06:38:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 73CB76B0095; Tue, 2 Jul 2024 02:38:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6ECB16B0096; Tue, 2 Jul 2024 02:38:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 58CEE6B0098; Tue, 2 Jul 2024 02:38:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 32A966B0095 for ; Tue, 2 Jul 2024 02:38:18 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id D151DC1FAB for ; Tue, 2 Jul 2024 06:38:17 +0000 (UTC) X-FDA: 82293858234.01.DB0203B Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) by imf04.hostedemail.com (Postfix) with ESMTP id 25BD740007 for ; Tue, 2 Jul 2024 06:38:15 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=F8VBhrt5; spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719902274; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=; b=HTOKP+iK73T3R/95fOHcT1hMW0OTtA3iCdBrs94OVziDBBQ3+s0JDnCYcWy8d1d3GawpmX FCp9+sOJQjZ9HrDirouod0KaH6wY0mICRQSGLz+7ocqcLZmU9vEXJsuxSbPp7wvFHGeX/U m+Ppit9V/S5PopPsWeP6VVkn8EE8kUM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719902274; a=rsa-sha256; cv=none; b=2SOkqhN71+0lf9rkO1V1CsbkTFI6ePe3LMxBfW3VNxfaPrxK5FCtltP4FXhpDq2K3fNv0m dDkSisU8o4fIg5IeyW7BjxA0zHMazrPkbP+XXSKNv1H9eBi21QQPtHXITaKJb68ZhxirlC QtMjFslOQBDbIf+gXdSi6qv9GEg51xI= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=F8VBhrt5; spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-64b29539d87so32956557b3.0 for ; Mon, 01 Jul 2024 23:38:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719902295; x=1720507095; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=; b=F8VBhrt52MyVTyqEuU0K9TB0mdAdP5pwcF0g24Pl7ZLnUTYyGTWqPoGuerPCmooFyx CMvp6GjiLi9G8BV1X9oAoBiK4ZdXnLhB9gu7K/NXfsmdvjqWNkW5Um/dmfVZzioCCUtK LbJujK4JhXlk6TwWF08UZTWInt5nIU1wqDRsd0TT+ikO698B5L7MYcDKQbgoCFbEm7md pN/hC/8j6Fe8NMzad8UkD+yoss1AGrdnp8hYSCSNNBH7QagQlUGZWdkYVYzaCCC1z1/r r+sTyarfjm7BlrZ9NWM/hyhRAKVKmN6hLt4bNKvO2idxITm22KIfvjDx8Ozf/9OH8adC NFDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719902295; x=1720507095; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=; b=cQDPJx7DnxZYtNz5sBPfcsD8jaisk/ktsnB8gz8zoMmHH/mRiNkhOkAS8OowNQByPT 5L3Kozi575mWnnvqTuhl0HzTYfzaWotfRelfK3e4gqwOv0iGXyy1wY86F7wxuXYzbDee Q7PT1K5yKBWK5ZHnd6NFoYiMRGuFkr04JSI4Jilidcz62JiDnfHFkN+Wt943uLcMZvgU 3M7QfDCEScScvTiehPF8zCG0bkA63HfggKffSFliLj2LLA7HFX0Hz3GDhKgzeTBxmwYt QSi6MBh9lihQCs9nRGQ/buFNfkTrOk+NJ9QoiO0B7aAz7ljpCACxrte7Yr4GV21y2ZGT whaw== X-Gm-Message-State: AOJu0YzNYJsFid70B8029q1t6M/JlEF+SVzGaDCLWy3ukFp3TkDiczS0 x50hw83RvPAByqJB8+htzjDGnsbop6qabdtMcHZn2VCK53OAayB6nmgaBTKWkgx0yKDRzc1sQDY JRORymV9tFgCg1qqT7917b/ulJyc= X-Google-Smtp-Source: AGHT+IE/g123BvIUt8egs7LDSgvx0zto/03/+PhEOxvYthmRnfmop1Ema8vX+uQsUGEPZckziZdDZSlzFxFc6f5evRk= X-Received: by 2002:a81:9e52:0:b0:64a:4728:eed with SMTP id 00721157ae682-64c74b5bc93mr86364477b3.46.1719902294993; Mon, 01 Jul 2024 23:38:14 -0700 (PDT) MIME-Version: 1.0 References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> In-Reply-To: <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> From: Yafang Shao Date: Tue, 2 Jul 2024 14:37:38 +0800 Message-ID: Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist To: Andrew Morton Cc: linux-mm@kvack.org, Matthew Wilcox , David Rientjes , "Huang, Ying" , Mel Gorman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 25BD740007 X-Stat-Signature: qnhhinihfn57zzwtg8hskbzwn9e4k79j X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1719902295-486004 X-HE-Meta: U2FsdGVkX1/DsIjcl0T0vAJw2rJD9ilOoFippuPeSxWJ0tcXQcii3fBCZor1jBpvL7Z/JW03iPTUIgBu+K07Q4trlOKHdBgRep6ewpLLiis/jCZKA+VaazR7Es9dz9XZTimFczSqAnBz6Gzq9r9g4/ccpiWE3CKIVp54bg4rBufLqXH4Yd3nCGVKOcgN0tDYNNfWl1zWyLtbIc+F5StYqmlg1VA0hR+vvA0LouhpcAOQilxNJpKJbZfbf2hZUEX5TpSvH9RWOJjSGZqSvM5U8qFYkjxk4WShokOFoIFDZMM7K5y9QkqAUBvulCckcBt6u1EAMfuKZpuAgRjfeGgIcLMda/23uJLGPafSw95Y32wZZDfqKiJBQERGnaCk6FKJ8gyDXa/4Rs1UvfhG4mxFMUtVbAVLGMxO+BHobkJXPneZoCj5WrFGAXoXu0Y3qOudce6rStVpKzn8l4CdDvPrSEjth2k2ZhRb8LY3MTDPIJfJtpNE+PoU4yvK94R48PLJJVlBZquAz4Zc9jOeM6RrtBBnKMOG5NnUKDQ4+qLJkShnPsETsfgJG1nNY4Ggnx2Nut+n+reMIvj/PuaXCDWZDYHULyxMcfMkjzvmyHBxbybkJCUDbxpL7pszrei1pbWkdIIM/ujoII9v+d19CH+p/27vnO0bw5e/N1rQm09sEJaMtjlRHovSFpz73M5UX5WovZyKxwzqmBYq1ZW/HoG4oSn5xf7wmi1W6z3/xdJ2FRY635OHtxG64kQes8PopB7UT6FZ7CjOAYQYjta/o8cagytF6pHOaL2tSIveKyK2DbQ4vf9KyMjStdrj6XbPqY/omIvF8t0umzMhxmkJaFL3lqnL/fwQuBURORMckb6A3uOrpAwBNFmhDHRRX6Wxv5j/Uq0wxEAIHbBxXobY/PvrV2L5yT0orocwqSEf0zGyiFZVwCPJ33cRTzODCCw5El+19MAainCJ79FdpjxSgNZ nREAs0yJ Q2LZsVwnH2lnXyKVJ8XC0ihFoe2NuMOwg5gIVX34OfHM2a2XZHZtgbPKvX5jejY5eGfEiQu4WeOHUKd7l46lzFNIgWFagE177nR9zufDJrVOtNuhmr/1PyYohPxswj0sEWqDlujMwH1mCD20lWudjWZwvnsCj6cVffiRU1T7lbGFx0uKumC/ye+TRsl96Hp2K3vwkDZ6TpAREHywfQ0N8lWiT1xyS4/r6VTGOp/Veao04DJhcDG5+HdsgaXYvGIEeJqVGa/2fTR91QPSmXzmDENdm+ESYcNNAyOXsPcnfPBkJKWRHHyfzofoWobD8TNRe8lLEPRKxD20hL73gWFMeShS3wluHYzgB+fYspSE9tt3tOBo9lmvL0GAeFeQVTVSo1LXC2Vvwe019SZv/cvCJ+zlsQAk46i++FJFJ6Npa2fEIDj72WemIkz2Eh0WQSLs1yZo9/hi1zbeedP8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: > > On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wro= te: > > > Currently, we're encountering latency spikes in our container environme= nt > > when a specific container with multiple Python-based tasks exits. These > > tasks may hold the zone->lock for an extended period, significantly > > impacting latency for other containers attempting to allocate memory. > > Is this locking issue well understood? Is anyone working on it? A > reasonably detailed description of the issue and a description of any > ongoing work would be helpful here. In our containerized environment, we have a specific type of container that runs 18 processes, each consuming approximately 6GB of RSS. These processes are organized as separate processes rather than threads due to the Python Global Interpreter Lock (GIL) being a bottleneck in a multi-threaded setup. Upon the exit of these containers, other containers hosted on the same machine experience significant latency spikes. Our investigation using perf tracing revealed that the root cause of these spikes is the simultaneous execution of exit_mmap() by each of the exiting processes. This concurrent access to the zone->lock results in contention, which becomes a hotspot and negatively impacts performance. The perf results clearly indicate this contention as a primary contributor to the observed latency issues. + 77.02% 0.00% uwsgi [kernel.kallsyms] [k] mmput =E2=96=92 - 76.98% 0.01% uwsgi [kernel.kallsyms] [k] exit_mmap =E2=96=92 - 76.97% exit_mmap =E2=96=92 - 58.58% unmap_vmas =E2=96=92 - 58.55% unmap_single_vma =E2=96=92 - unmap_page_range =E2=96=92 - 58.32% zap_pte_range =E2=96=92 - 42.88% tlb_flush_mmu =E2=96=92 - 42.76% free_pages_and_swap_cache =E2=96=92 - 41.22% release_pages =E2=96=92 - 33.29% free_unref_page_list =E2=96=92 - 32.37% free_unref_page_commit =E2=96=92 - 31.64% free_pcppages_bulk =E2=96=92 + 28.65% _raw_spin_lock =E2=96=92 1.28% __list_del_entry_valid =E2=96=92 + 3.25% folio_lruvec_lock_irqsave =E2=96=92 + 0.75% __mem_cgroup_uncharge_list =E2=96=92 0.60% __mod_lruvec_state =E2=96=92 1.07% free_swap_cache =E2=96=92 + 11.69% page_remove_rmap =E2=96=92 0.64% __mod_lruvec_page_state - 17.34% remove_vma =E2=96=92 - 17.25% vm_area_free =E2=96=92 - 17.23% kmem_cache_free =E2=96=92 - 17.15% __slab_free =E2=96=92 - 14.56% discard_slab =E2=96=92 free_slab =E2=96=92 __free_slab =E2=96=92 __free_pages =E2=96=92 - free_unref_page =E2=96=92 - 13.50% free_unref_page_commit =E2=96=92 - free_pcppages_bulk =E2=96=92 + 13.44% _raw_spin_lock By enabling the mm_page_pcpu_drain() we can find the detailed stack: <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp e=3D1 <...>-1540432 [224] d..3. 618048.023887: =3D> free_pcppages_bulk =3D> free_unref_page_commit =3D> free_unref_page_list =3D> release_pages =3D> free_pages_and_swap_cache =3D> tlb_flush_mmu =3D> zap_pte_range =3D> unmap_page_range =3D> unmap_single_vma =3D> unmap_vmas =3D> exit_mmap =3D> mmput =3D> do_exit =3D> do_group_exit =3D> get_signal =3D> arch_do_signal_or_restart =3D> exit_to_user_mode_prepare =3D> syscall_exit_to_user_mode =3D> do_syscall_64 =3D> entry_SYSCALL_64_after_hwframe The servers experiencing these issues are equipped with impressive hardware specifications, including 256 CPUs and 1TB of memory, all within a single NUMA node. The zoneinfo is as follows, Node 0, zone Normal pages free 144465775 boost 0 min 1309270 low 1636587 high 1963904 spanned 564133888 present 296747008 managed 291974346 cma 0 protection: (0, 0, 0, 0) ... ... pagesets cpu: 0 count: 2217 high: 6392 batch: 63 vm stats threshold: 125 cpu: 1 count: 4510 high: 6392 batch: 63 vm stats threshold: 125 cpu: 2 count: 3059 high: 6392 batch: 63 ... The high is around 100 times the batch size. We also traced the latency associated with the free_pcppages_bulk() function during the container exit process: 19:48:54 nsecs : count distribution 0 -> 1 : 0 | = | 2 -> 3 : 0 | = | 4 -> 7 : 0 | = | 8 -> 15 : 0 | = | 16 -> 31 : 0 | = | 32 -> 63 : 0 | = | 64 -> 127 : 0 | = | 128 -> 255 : 0 | = | 256 -> 511 : 148 |***************** = | 512 -> 1023 : 334 |**************************************= **| 1024 -> 2047 : 33 |*** = | 2048 -> 4095 : 5 | = | 4096 -> 8191 : 7 | = | 8192 -> 16383 : 12 |* = | 16384 -> 32767 : 30 |*** = | 32768 -> 65535 : 21 |** = | 65536 -> 131071 : 15 |* = | 131072 -> 262143 : 27 |*** = | 262144 -> 524287 : 84 |********** = | 524288 -> 1048575 : 203 |************************ = | 1048576 -> 2097151 : 284 |********************************** = | 2097152 -> 4194303 : 327 |**************************************= * | 4194304 -> 8388607 : 215 |************************* = | 8388608 -> 16777215 : 116 |************* = | 16777216 -> 33554431 : 47 |***** = | 33554432 -> 67108863 : 8 | = | 67108864 -> 134217727 : 3 | = | avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 The latency can reach tens of milliseconds. By adjusting the vm.percpu_pagelist_high_fraction parameter to set the minimum pagelist high at 4 times the batch size, we were able to significantly reduce the latency associated with the free_pcppages_bulk() function during container exits.: nsecs : count distribution 0 -> 1 : 0 | = | 2 -> 3 : 0 | = | 4 -> 7 : 0 | = | 8 -> 15 : 0 | = | 16 -> 31 : 0 | = | 32 -> 63 : 0 | = | 64 -> 127 : 0 | = | 128 -> 255 : 120 | = | 256 -> 511 : 365 |* = | 512 -> 1023 : 201 | = | 1024 -> 2047 : 103 | = | 2048 -> 4095 : 84 | = | 4096 -> 8191 : 87 | = | 8192 -> 16383 : 4777 |************** = | 16384 -> 32767 : 10572 |******************************* = | 32768 -> 65535 : 13544 |**************************************= **| 65536 -> 131071 : 12723 |************************************* = | 131072 -> 262143 : 8604 |************************* = | 262144 -> 524287 : 3659 |********** = | 524288 -> 1048575 : 921 |** = | 1048576 -> 2097151 : 122 | = | 2097152 -> 4194303 : 5 | = | avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 After successfully tuning the vm.percpu_pagelist_high_fraction sysctl knob to set the minimum pagelist high at a level that effectively mitigated latency issues, we observed that other containers were no longer experiencing similar complaints. As a result, we decided to implement this tuning as a permanent workaround and have deployed it across all clusters of servers where these containers may be deployed. > > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the= value of hot per-cpu > > page lists. A user can specify a number like 100 to allocate 1/100th o= f > > each zone between per-cpu lists. > > > > +The minimum number of pages that can be stored in per-CPU page lists i= s > > +four times the batch value. By writing '-1' to this sysctl, you can se= t > > +this minimum value. > > I suggest we also describe why an operator would want to set this, and > the expected effects of that action. will improve it. > > > The batch value of each per-cpu page list remains the same regardless = of > > the value of the high fraction so allocation latencies are unaffected. > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 2e22ce5675ca..e7313f9d704b 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int = batch, int cpu_online, > > int nr_split_cpus; > > unsigned long total_pages; > > > > + /* Setting -1 to set the minimum pagelist size, four times the ba= tch size */ > > Some old-timers still use 80-column xterms ;) will change it. Regards Yafang