From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6241CC3DA41 for ; Thu, 11 Jul 2024 12:41:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A8FD6B0085; Thu, 11 Jul 2024 08:41:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 658C76B0088; Thu, 11 Jul 2024 08:41:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F8646B008A; Thu, 11 Jul 2024 08:41:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2C1186B0085 for ; Thu, 11 Jul 2024 08:41:21 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C2A48805AC for ; Thu, 11 Jul 2024 12:41:20 +0000 (UTC) X-FDA: 82327432320.14.12CDC27 Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42]) by imf11.hostedemail.com (Postfix) with ESMTP id E8A5F40015 for ; Thu, 11 Jul 2024 12:41:18 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="RBNN/BrF"; spf=pass (imf11.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720701662; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=; b=ffHEBE0FTrtIDVkb8o3BkyxLEJzaSbCzXujD+Tx76Yz6+eAVK9sMwIakxn3qlgw7cULqdV XOFQ7Ji83mogH+g1M2aAvEDfSn3QrRa1xfbuFl0AdAVLdtutYKo8DGAvCEl5+pCXs2uMZL PYyGw5mYOuEARN9HU64zDusblYQCr+M= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="RBNN/BrF"; spf=pass (imf11.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720701662; a=rsa-sha256; cv=none; b=Y/4Dhn6yZ/LT0RcFq4AK83cL58EjiSVF6Hb0HnAqaO+hKGbRgylexLkLYJk4/SftuWNL5p Q7k95GGEPQHouwXGXGc//0MKjvzA42QhqwOO48yuMj3/rXnba4WAZjsjjt7ocSArtu8AzE rpWLPMUgSHWAPCgLBlL9tAQ8oYrgzZ0= Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6b73dc6e7aaso5743026d6.0 for ; Thu, 11 Jul 2024 05:41:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720701678; x=1721306478; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=; b=RBNN/BrFKHesXgFX7Waafbj2jyQloHGopgv0xM2HJ8i0CcdHc5FA+MtgaJlWJxObBc NNyHLPqxx9kKrlXbTEWOF7DxbkefsHKNKK5aA0BpW8djHagfnjmh1koDbVgIgs+GEd4J HhDNjzWdW7te8PyFsPa5mpYr+lwMSqJlZgYJDltimefR3+fYx0egHtJO62bnQMrtcn6F C/hDGEcOHGM2t4gQrT0milC3ZAQAt445Ia1IwEqbI5k7NLG+euzkNvOOG1CigIrYN+xc PSN23aPRNFaJ4I1H/BLJWUk/ekYCyIrKcPL5/LbGUHxJrjZYF3b+q8UByST/n6qb1KFs J9tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720701678; x=1721306478; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=; b=Z8ravK+2YOK+mRbj64GUCtpegTtPJJwRy/cSwucoIafJdV2joxGBxXpF6T/u1bb61T NsgKrjBkGR0V/sjbvp7MGFMIMdcXimRxy/+FmR/9q5KjX6BmdLTCJenc7PhQIzTWf/9P q2oZjs13TltIKhYzpKyF4ol+nNhwGuQe8sj9P2c9h62tXThVAi2+6O8xTwPSBqFtPtLt VF4jNBHdZMvveBQF7kxd/ssuuL37ISBwY0v9kJf/ETLRr/bxYr+qB8M+0pAOg9H4f53B MiQg0/aDlilYuD/KAwOMXpyzArB3NnEX4hgfIKH4orCgO1fFHnQhlOyGbuB21SIELmEF /dXg== X-Forwarded-Encrypted: i=1; AJvYcCWybh/Fonn1h4o6wbyZWGJSTQj5RzG4hrF0oDdyrEtte3acgvv8Dv1VsYw/UxXGydNYRrRzkdxWsrK/heZdyFrtzrw= X-Gm-Message-State: AOJu0YxrDrr9HepqpQR6mwM9V7vuwaNYs0yiZCDQUZKzMq154N38ftjc j6oU1x/ON4yJCL18lPk6oWI2PSRFsnooQvaBEbBoESWzn9V29uM9OYj2NGfodAelR/h6FSzt0rq NtGag79/b39eEl0HgAoWHTp0ML6U= X-Google-Smtp-Source: AGHT+IGgKOPd4ZQOsxhYc4Lx6wUd2BBbf2ME9D5kKkSQJcoupfwpZ+xhG8ooH8o4mzr8so1UItUCFI2CCYLMPrjD3WE= X-Received: by 2002:a05:6214:d01:b0:6b5:49c9:ed53 with SMTP id 6a1803df08f44-6b61bf4c959mr100885556d6.37.1720701677849; Thu, 11 Jul 2024 05:41:17 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sewga0wx.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bk349vg4.fsf@yhuang6-desk2.ccr.corp.intel.com> <8734og9on4.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <8734og9on4.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Thu, 11 Jul 2024 20:40:41 +0800 Message-ID: Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: E8A5F40015 X-Stat-Signature: fa8jhtwis1cmum75fyjjmfhasar8p1ic X-Rspam-User: X-HE-Tag: 1720701678-738527 X-HE-Meta: U2FsdGVkX1/8c4ndrWpZFpwOdiMw9XpshYXeQBMcBMAOx8mIiLJY1/loICNSmb5F+ig34eWugwNewpiKzi4eBnOijgUtl8vdHvsMqZzBS+ts2YzlYWHipgmAr4GytjaEI3ocZqYtWgx+Wx6uOV4BYowQI8XkfkJ1/Yls8K7hDct/CVpORTkCJ3IekaB7ozV8AG/oXsfpQvJJOPvZ2aqeZxZFwmWHmtAkQlEB5+jSBBcpihi4pKs5OkP6GCZ7GmXxixaUIXrSa0XVGQjqSwcMvmgOW7w0zwXRznnC1OJENcT+EKZhOPPOP5IvOhmvAHwet+AbWJwvh2PCVHcC/L+4ra0Ikf5I/s5oJMIxMnKJI3HfeQNY8kitSheg7X7KzFfKwc4Rb5Wpfvy5ZI2wsSSJufolGJBWZUdbTrb+mgoMgQg0Kxk4bjYWWLoxNTUMFZssJiyKhLypfJ4G/jWMZ59KCQiKqTxqR3CRfiQ8VsAHXa2P82Q079FrtiT47CIAoJ2oPrfcLQOKTTj9PGftj8rhHFIuItgIWmOp0Jc1sXDgDNChykAuefxZh+1bX+9O+mIBmuKv8dYjbvPkw8tdHmuYMcV3n49gQWGvjL1jDsVFxLMUWSCPmryD19gcwElitiguI7L+Y1Oww4wzNjv53Bw0DicOLHh6F5ISReMOJFaYn75OKs4Jw4BSR8omZVdXnfL+risP3ySMmDFmvnxZaj93zVRICtZb+R4wyud+vDVidW7DSH8+BFu47FgJjDY1FvjBzC+V7CGK3T/lkL5GTadO+ZOBOy1EcnzFATiF+57urNCc46+Hr/2XAj7wExMaie44igYoxF8tBh7jX6GBRgEvjzCgZjYTe8Tmma70ss2oFhDB20B0pmEcWfGnjXImyHlpbayEq5aVbZJiNU79fch3s67TTPH78GT7g88iQ2BgbTk555drzm0kzcoHxF0jlUco2tLaFOMhJ2BvWf0aSUc JyvnPAwC y8Iwz6UNaXTuRwv4XEJqfMn+eDTIxO7oIj04L3s4K8PNQ7lxHgB2S6KdVS7/MctKRByzOyDKIKRiAuNLzySSVXUw/FZ++TeaqbBjGMomoOGsTSCY7Ee2T+yLZjyzo+iGgDnUVW6TRMkiO5N7yB7ItMQzmflBQrv74E4bYVzGHxR/qC5+h9LJXGlP/ZXbm8R7diw5qEeOHBf5F0IAqY8hUzstk2yl89QZqFSDER6W8gjty4FoWQZ013GieaKpa88Lz5P9gN35mVZq2zdfC0fppAY1nXI1U8yAU2GwLVqG/OLSyURBPEG0rJlXUv3pZntUNMf/nNX48hppJ0zLaAuDRBouFwMTDKUoNijB7TXkdL93iyFPn5edHHB3ARxRK99GLoWeC9t8LSEXEmCmDjzPma1xlbbavlmq+wybO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000010, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 11, 2024 at 7:05=E2=80=AFPM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Thu, Jul 11, 2024 at 4:38=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Thu, Jul 11, 2024 at 2:40=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Wed, Jul 10, 2024 at 11:02=E2=80=AFAM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > Background > >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> >> > > >> >> >> > In our containerized environment, we have a specific type of c= ontainer > >> >> >> > that runs 18 processes, each consuming approximately 6GB of RS= S. These > >> >> >> > processes are organized as separate processes rather than thre= ads due > >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck= in a > >> >> >> > multi-threaded setup. Upon the exit of these containers, other > >> >> >> > containers hosted on the same machine experience significant l= atency > >> >> >> > spikes. > >> >> >> > > >> >> >> > Investigation > >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> >> > > >> >> >> > My investigation using perf tracing revealed that the root cau= se of > >> >> >> > these spikes is the simultaneous execution of exit_mmap() by e= ach of > >> >> >> > the exiting processes. This concurrent access to the zone->loc= k > >> >> >> > results in contention, which becomes a hotspot and negatively = impacts > >> >> >> > performance. The perf results clearly indicate this contention= as a > >> >> >> > primary contributor to the observed latency issues. > >> >> >> > > >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] = [k] mmput > >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] = [k] exit_mmap > >> >> >> > - 76.97% exit_mmap > >> >> >> > - 58.58% unmap_vmas > >> >> >> > - 58.55% unmap_single_vma > >> >> >> > - unmap_page_range > >> >> >> > - 58.32% zap_pte_range > >> >> >> > - 42.88% tlb_flush_mmu > >> >> >> > - 42.76% free_pages_and_swap_cache > >> >> >> > - 41.22% release_pages > >> >> >> > - 33.29% free_unref_page_list > >> >> >> > - 32.37% free_unref_page_commit > >> >> >> > - 31.64% free_pcppages_bulk > >> >> >> > + 28.65% _raw_spin_lock > >> >> >> > 1.28% __list_del_entry_v= alid > >> >> >> > + 3.25% folio_lruvec_lock_irqsave > >> >> >> > + 0.75% __mem_cgroup_uncharge_list > >> >> >> > 0.60% __mod_lruvec_state > >> >> >> > 1.07% free_swap_cache > >> >> >> > + 11.69% page_remove_rmap > >> >> >> > 0.64% __mod_lruvec_page_state > >> >> >> > - 17.34% remove_vma > >> >> >> > - 17.25% vm_area_free > >> >> >> > - 17.23% kmem_cache_free > >> >> >> > - 17.15% __slab_free > >> >> >> > - 14.56% discard_slab > >> >> >> > free_slab > >> >> >> > __free_slab > >> >> >> > __free_pages > >> >> >> > - free_unref_page > >> >> >> > - 13.50% free_unref_page_commit > >> >> >> > - free_pcppages_bulk > >> >> >> > + 13.44% _raw_spin_lock > >> >> >> > >> >> >> I don't think your change will reduce zone->lock contention cycl= es. So, > >> >> >> I don't find the value of the above data. > >> >> >> > >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertine= nt page, > >> >> >> > with the majority of them being regular order-0 user pages. > >> >> >> > > >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcp= u_drain: page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp > >> >> >> > e=3D1 > >> >> >> > <...>-1540432 [224] d..3. 618048.023887: > >> >> >> > =3D> free_pcppages_bulk > >> >> >> > =3D> free_unref_page_commit > >> >> >> > =3D> free_unref_page_list > >> >> >> > =3D> release_pages > >> >> >> > =3D> free_pages_and_swap_cache > >> >> >> > =3D> tlb_flush_mmu > >> >> >> > =3D> zap_pte_range > >> >> >> > =3D> unmap_page_range > >> >> >> > =3D> unmap_single_vma > >> >> >> > =3D> unmap_vmas > >> >> >> > =3D> exit_mmap > >> >> >> > =3D> mmput > >> >> >> > =3D> do_exit > >> >> >> > =3D> do_group_exit > >> >> >> > =3D> get_signal > >> >> >> > =3D> arch_do_signal_or_restart > >> >> >> > =3D> exit_to_user_mode_prepare > >> >> >> > =3D> syscall_exit_to_user_mode > >> >> >> > =3D> do_syscall_64 > >> >> >> > =3D> entry_SYSCALL_64_after_hwframe > >> >> >> > > >> >> >> > The servers experiencing these issues are equipped with impres= sive > >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory,= all > >> >> >> > within a single NUMA node. The zoneinfo is as follows, > >> >> >> > > >> >> >> > Node 0, zone Normal > >> >> >> > pages free 144465775 > >> >> >> > boost 0 > >> >> >> > min 1309270 > >> >> >> > low 1636587 > >> >> >> > high 1963904 > >> >> >> > spanned 564133888 > >> >> >> > present 296747008 > >> >> >> > managed 291974346 > >> >> >> > cma 0 > >> >> >> > protection: (0, 0, 0, 0) > >> >> >> > ... > >> >> >> > pagesets > >> >> >> > cpu: 0 > >> >> >> > count: 2217 > >> >> >> > high: 6392 > >> >> >> > batch: 63 > >> >> >> > vm stats threshold: 125 > >> >> >> > cpu: 1 > >> >> >> > count: 4510 > >> >> >> > high: 6392 > >> >> >> > batch: 63 > >> >> >> > vm stats threshold: 125 > >> >> >> > cpu: 2 > >> >> >> > count: 3059 > >> >> >> > high: 6392 > >> >> >> > batch: 63 > >> >> >> > > >> >> >> > ... > >> >> >> > > >> >> >> > The pcp high is around 100 times the batch size. > >> >> >> > > >> >> >> > I also traced the latency associated with the free_pcppages_bu= lk() > >> >> >> > function during the container exit process: > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | = | > >> >> >> > 2 -> 3 : 0 | = | > >> >> >> > 4 -> 7 : 0 | = | > >> >> >> > 8 -> 15 : 0 | = | > >> >> >> > 16 -> 31 : 0 | = | > >> >> >> > 32 -> 63 : 0 | = | > >> >> >> > 64 -> 127 : 0 | = | > >> >> >> > 128 -> 255 : 0 | = | > >> >> >> > 256 -> 511 : 148 |***************** = | > >> >> >> > 512 -> 1023 : 334 |*************************= ***************| > >> >> >> > 1024 -> 2047 : 33 |*** = | > >> >> >> > 2048 -> 4095 : 5 | = | > >> >> >> > 4096 -> 8191 : 7 | = | > >> >> >> > 8192 -> 16383 : 12 |* = | > >> >> >> > 16384 -> 32767 : 30 |*** = | > >> >> >> > 32768 -> 65535 : 21 |** = | > >> >> >> > 65536 -> 131071 : 15 |* = | > >> >> >> > 131072 -> 262143 : 27 |*** = | > >> >> >> > 262144 -> 524287 : 84 |********** = | > >> >> >> > 524288 -> 1048575 : 203 |************************ = | > >> >> >> > 1048576 -> 2097151 : 284 |*************************= ********* | > >> >> >> > 2097152 -> 4194303 : 327 |*************************= ************** | > >> >> >> > 4194304 -> 8388607 : 215 |*************************= | > >> >> >> > 8388608 -> 16777215 : 116 |************* = | > >> >> >> > 16777216 -> 33554431 : 47 |***** = | > >> >> >> > 33554432 -> 67108863 : 8 | = | > >> >> >> > 67108864 -> 134217727 : 3 | = | > >> >> >> > > >> >> >> > The latency can reach tens of milliseconds. > >> >> >> > > >> >> >> > Experimenting > >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >> >> >> > > >> >> >> > vm.percpu_pagelist_high_fraction > >> >> >> > -------------------------------- > >> >> >> > > >> >> >> > The kernel version currently deployed in our production enviro= nment is the > >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the > >> >> >> > >> >> >> IMHO, we should focus on upstream activity in the cover letter a= nd patch > >> >> >> description. And I don't think that it's necessary to describe = the > >> >> >> alternative solution with too much details. > >> >> >> > >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the = value of > >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch = size during > >> >> >> > page draining, which subsequently leads to a substantial reduc= tion in > >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I obser= ved a notable > >> >> >> > improvement in latency. > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | = | > >> >> >> > 2 -> 3 : 0 | = | > >> >> >> > 4 -> 7 : 0 | = | > >> >> >> > 8 -> 15 : 0 | = | > >> >> >> > 16 -> 31 : 0 | = | > >> >> >> > 32 -> 63 : 0 | = | > >> >> >> > 64 -> 127 : 0 | = | > >> >> >> > 128 -> 255 : 120 | = | > >> >> >> > 256 -> 511 : 365 |* = | > >> >> >> > 512 -> 1023 : 201 | = | > >> >> >> > 1024 -> 2047 : 103 | = | > >> >> >> > 2048 -> 4095 : 84 | = | > >> >> >> > 4096 -> 8191 : 87 | = | > >> >> >> > 8192 -> 16383 : 4777 |************** = | > >> >> >> > 16384 -> 32767 : 10572 |*************************= ****** | > >> >> >> > 32768 -> 65535 : 13544 |*************************= ***************| > >> >> >> > 65536 -> 131071 : 12723 |*************************= ************ | > >> >> >> > 131072 -> 262143 : 8604 |*************************= | > >> >> >> > 262144 -> 524287 : 3659 |********** = | > >> >> >> > 524288 -> 1048575 : 921 |** = | > >> >> >> > 1048576 -> 2097151 : 122 | = | > >> >> >> > 2097152 -> 4194303 : 5 | = | > >> >> >> > > >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also = decrease the > >> >> >> > pcp high watermark size to a minimum of four times the batch s= ize. While > >> >> >> > this could theoretically affect throughput, as highlighted by = Ying[0], we > >> >> >> > have yet to observe any significant difference in throughput w= ithin our > >> >> >> > production environment after implementing this change. > >> >> >> > > >> >> >> > Backporting the series "mm: PCP high auto-tuning" > >> >> >> > ------------------------------------------------- > >> >> >> > >> >> >> Again, not upstream activity. We can describe the upstream beha= vior > >> >> >> directly. > >> >> > > >> >> > Andrew has requested that I provide a more comprehensive analysis= of > >> >> > this issue, and in response, I have endeavored to outline all the > >> >> > pertinent details in a thorough and detailed manner. > >> >> > >> >> IMHO, upstream activity can provide comprehensive analysis of the i= ssue > >> >> too. And, your patch has changed much from the first version. It'= s > >> >> better to describe your current version. > >> > > >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, t= he > >> > code is almost the same with the upstream kernel wrt the pcp. I have > >> > thoroughly documented the detailed data showcasing the changes in th= e > >> > backported version, providing a clear picture of the results. Howeve= r, > >> > it's crucial to note that I am unable to directly run the upstream > >> > kernel on our production environment due to practical constraints. > >> > >> IMHO, the patch is for upstream kernel, not some downstream kernel, so > >> focus should be the upstream activity. The issue of the upstream > >> kernel, and how to resolve it. The production environment test result= s > >> can be used to support the upstream change. > > > > The sole distinction in the pcp between version 6.1.y and the > > upstream kernel lies solely in the modifications made to the code by > > you. Furthermore, given that your code changes have now been > > successfully backported, what else do you expect me to do ? > > If you can run the upstream kernel directly with some proxy workloads, > it will be better. But, I understand that this may be not easy for you. > > So, what I really expect you to do is to organize the patch description > in an upstream centric way. Describe the issue of the upstream kernel, > and how do you resolve it. Although your test data comes from a > downstream kernel with the same page allocator behavior. > > >> > >> >> >> > My second endeavor was to backport the series titled > >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual= patches, > >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deploy= ment in our > >> >> >> > production environment, I noted a pronounced reduction in late= ncy. The > >> >> >> > observed outcomes are as enumerated below: > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | = | > >> >> >> > 2 -> 3 : 0 | = | > >> >> >> > 4 -> 7 : 0 | = | > >> >> >> > 8 -> 15 : 0 | = | > >> >> >> > 16 -> 31 : 0 | = | > >> >> >> > 32 -> 63 : 0 | = | > >> >> >> > 64 -> 127 : 0 | = | > >> >> >> > 128 -> 255 : 0 | = | > >> >> >> > 256 -> 511 : 0 | = | > >> >> >> > 512 -> 1023 : 0 | = | > >> >> >> > 1024 -> 2047 : 2 | = | > >> >> >> > 2048 -> 4095 : 11 | = | > >> >> >> > 4096 -> 8191 : 3 | = | > >> >> >> > 8192 -> 16383 : 1 | = | > >> >> >> > 16384 -> 32767 : 2 | = | > >> >> >> > 32768 -> 65535 : 7 | = | > >> >> >> > 65536 -> 131071 : 198 |********* = | > >> >> >> > 131072 -> 262143 : 530 |************************ = | > >> >> >> > 262144 -> 524287 : 824 |*************************= ************* | > >> >> >> > 524288 -> 1048575 : 852 |*************************= ***************| > >> >> >> > 1048576 -> 2097151 : 714 |*************************= ******** | > >> >> >> > 2097152 -> 4194303 : 389 |****************** = | > >> >> >> > 4194304 -> 8388607 : 143 |****** = | > >> >> >> > 8388608 -> 16777215 : 29 |* = | > >> >> >> > 16777216 -> 33554431 : 1 | = | > >> >> >> > > >> >> >> > Compared to the previous data, the maximum latency has been re= duced to > >> >> >> > less than 30ms. > >> >> >> > >> >> >> People don't care too much about page freeing latency during pro= cesses > >> >> >> exiting. Instead, they care more about the process exiting time= , that > >> >> >> is, throughput. So, it's better to show the page allocation lat= ency > >> >> >> which is affected by the simultaneous processes exiting. > >> >> > > >> >> > I'm confused also. Is this issue really hard to understand ? > >> >> > >> >> IMHO, it's better to prove the issue directly. If you cannot prove= it > >> >> directly, you can try alternative one and describe why. > >> > > >> > Not all data can be verified straightforwardly or effortlessly. The > >> > primary focus lies in the zone->lock contention, which necessitates > >> > measuring the latency it incurs. To accomplish this, the > >> > free_pcppages_bulk() function serves as an effective tool for > >> > evaluation. Therefore, I have opted to specifically measure the > >> > latency associated with free_pcppages_bulk(). > >> > > >> > The rationale behind not measuring allocation latency is due to the > >> > necessity of finding a willing participant to endure potential delay= s, > >> > a task that proved unsuccessful as no one expressed interest. In > >> > contrast, assessing free_pcppages_bulk()'s latency solely requires > >> > identifying and experimenting with the source causing the delays, > >> > making it a more feasible approach. > >> > >> Can you run a benchmark program that do quite some memory allocation b= y > >> yourself to test it? > > > > I can have a try. > > Thanks! > > > However, is it the key point here? > > It's better to prove the issue directly instead of indirectly. > > > Why can't the lock contention be measured by the freeing? > > Have you measured the lock contention after adjusting > CONFIG_PCP_BATCH_SCALE_MAX? IIUC, the lock contention will become even > worse. Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will > hurt lock contention. I have said it several times, but it seems that > you don't agree with me. Can you prove I'm wrong with data? Now I understand the point. It seems we have different understandings regarding the zone lock contention. CPU A (Freer) CPU B (Allocator) lock zone->lock free pages lock zone->lock unlock zone->lock alloc pages unlock zone->lock If the Freer holds the zone lock for an extended period, the Allocator has to wait, right? Isn't that a lock contention issue? Lock contention affects not only CPU system usage but also latency. --=20 Regards Yafang