From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C465C3DA42 for ; Thu, 11 Jul 2024 02:25:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 022A76B00A7; Wed, 10 Jul 2024 22:25:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F14D26B00A8; Wed, 10 Jul 2024 22:25:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDC866B00A9; Wed, 10 Jul 2024 22:25:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BA2236B00A7 for ; Wed, 10 Jul 2024 22:25:45 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 603028031A for ; Thu, 11 Jul 2024 02:25:45 +0000 (UTC) X-FDA: 82325881050.15.212CAC3 Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf23.hostedemail.com (Postfix) with ESMTP id 81791140013 for ; Thu, 11 Jul 2024 02:25:43 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nkESh9Kr; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720664711; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FT9mLU7qmMfLifBarUQPBK4bB7T3vfoIBS0m+qeXav4=; b=ozFHWh6v5nXhB8cdtsFTqRFQcXPKfI7q0Tb7vlx2ovVpyP/mMHgjpZfX0DGXbc/aeFryox iRujlFMyvkE3X5gwFfRNaZ7DxhApRodjmrg8anHSWBreAVPAQDZv2MzOuAtKxqjCnVb2rc OU6qYJYkfQN8g2NmK9U/2iSb200d45Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720664711; a=rsa-sha256; cv=none; b=puWbFiPVLBOAuhXgV8F/dbtCrYfjYixqiDVE7LNh45Cw5F6xBP6W0aOUgNQy+HksDbIC8j 2S7ySQgEqFrraAiHBbF1kOW5gRKSM5ngGzPJrGDiunYW+6rAlMOEuX6Akq9d7oOhL8Tztq e8X/HaxUIoB1xtn49yIj8Oa9v5pWArs= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nkESh9Kr; spf=pass (imf23.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qv1-f47.google.com with SMTP id 6a1803df08f44-6b5ec93f113so2332166d6.3 for ; Wed, 10 Jul 2024 19:25:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720664742; x=1721269542; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FT9mLU7qmMfLifBarUQPBK4bB7T3vfoIBS0m+qeXav4=; b=nkESh9KrWlIKEZE4BodjtViZJ5wuaA2yoq66MzVzWFfWjDLjDB+fdg51L6Pg90BDu/ NWsaGOvqy0aGauC9VPUKeCYS7/ERub0TdWnouBWQMhgzdJLFNyWQ/VHydHC3nXE0b7y7 owme4VzJ+edslY6Y46qYw2rO9ZO6tjRxfihQRoea32IdaANMFLBHcpi4i36GYDgHor76 uROszlV8iV6brjQ5u8DQ7w3o59HZ+NauyLisGNd7yUhFXXqTXUXq1A3L0kW4oxzAcETc UmfCTuXoPWbrJj9Id43l6zgXBFxDprho5+qbA5No4+iTS+t66VZV4y2iUFt/NnfasEFc ATmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720664742; x=1721269542; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FT9mLU7qmMfLifBarUQPBK4bB7T3vfoIBS0m+qeXav4=; b=g34TmMIQcXYxYfssRFvYjhweLd0TirJgHNA69IgYiHXtjL/d/XSdQpNyki294+eWtn Nt4TLSrtePuxg2LKMHYXWJKDgqpyiL6XcpEDu0IMvzJbmT73A0bWSDi4R0BypwrYSDaO 2BjxFJycgeuxK6umyGa6Y6L/EysCSsjP2G8/BirPi0RvLrPcbceE3d5/Jf0WMWhCweJI l+SONZQ7p3QynByF7v/ivmKv6rl33TKmyseWq9EPxUiMXDnIxMBEyerzUXKicimEQZFk KbF/UuSua9g93br0Ew3QfpY4jOaBjwZWgqv0uI1G954vH6GVAboAvDtocaohmQYVbRRm rb9w== X-Forwarded-Encrypted: i=1; AJvYcCWYo0sJoShMD/dre//TmIqj3bmpvugnYUSGfWvbEbrWYYlj9/Ruc7c8CdvuCFdWQv2sBcQ3mMqkTwgFWjBCD8JuiJ8= X-Gm-Message-State: AOJu0Yz7ib/DQE6jIFKEGalVSOROTs9qjiCKeV4GFlHzxtGDV8iqaXm4 JEN2zCYthv0racNaer6V6/n5q9w+i7G9XmnV7X8MgO+RlPjPmgj6tfgT/V4cDzDmzJaojM0Vwth 6R1jJZdgtGXRt2QcjqH1+hmrgD35usXuI X-Google-Smtp-Source: AGHT+IEDzqJai+S0IAiwA5Q9bWyxQ/5xnLxVLesqPrYMD4LYwmhUHa2CiEI0tiRqCXBvUK4zuFJcEIXfjmXnb/u/7Ss= X-Received: by 2002:ad4:576c:0:b0:6b5:4e24:5b38 with SMTP id 6a1803df08f44-6b61c1fa379mr95268486d6.60.1720664742524; Wed, 10 Jul 2024 19:25:42 -0700 (PDT) MIME-Version: 1.0 References: <20240707094956.94654-1-laoar.shao@gmail.com> <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Thu, 11 Jul 2024 10:25:06 +0800 Message-ID: Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max To: "Huang, Ying" Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 81791140013 X-Stat-Signature: huxdghdsqbfbuqyaangkos5cf11e4xwo X-HE-Tag: 1720664743-493459 X-HE-Meta: U2FsdGVkX1/u7rrlpTCskLbdQvZZEVTO5uCu4hbwutNJnr9FxE+t7U7z3E4rqIAuODa54CiJHbe+OXJh0D5hQhNQYkzWqQdgdVcEqjk6jnR0GJwV6Fslg1hI++wszvFgW4LSkyrtpQsNp4Xa6oJ74JKB9O78wIt1iIAbq0HRsZyo4sA+g1Wzu6o12t488qWhg/rWVECbZMcLbGcXQEh5bWboPHC5PlRh2xxuJ8+XdZZs52xguLBaVa4VzA2ajsbj2zovP+dU+p3H7kFTbnJI0MX8oKnINgkZoAKmZBTw6PpZqwZkvjSJJjG+UXJI7p4T7q77izy3v1HyMoniMpmRKK+nFh8Tps/y+TKm12eQzDS0l7EB+h9Psd1uxCf6wcsjv3DdJcnE/tRwR+I9gKY3QKSbUnL4/kQmjN5V24Bsi9GpE+fINnUFW9gP8mZ1HRCGQhdyDq1mAFgAABmR7lSd2gF43il4cS3+eR2R0Kn+wRziJTAraufMF4dExP9IvzHSSPkm5zcm31HOcVHquik5BHVjYgmOfKaYVm64H+lHk/ZduTDtz7+3thAnaCgmQyWOybIxF7aX+sm1PmixF7Ehm5iMNtz0mOS+O+ebC8zmdAnHL1vZfv6TyyEevjBegm1/+6ni35+9OIEc2Yb7FrWs+zLwt8Wqf3jtk9EJqT5E/znlI5kTQmKCpjFzNMrF1OLdoMZxdM6p9/M/K7kYt8n6Qpelo71xLnX/NRQ533e6bsQHN8TKSHtjgYPBdUEkisOhczKdgu/EHc9MiW2fz4wHOeZFvlYeVh+JbgjxW3wiVv1Xsz6Ys+QmTHBuhs4zbR4E9WnNAcVIDCbvlzox/jDMnXQxQq5M/2U865+k56vLcnpJ2wQWuUsr0z9s+lYa4pF00kqbaQcpCWS2hCZT/RhTxOK2fYX+fMKGoVmXDm3r2PBFAb/LYkwDMEdmYyA4l5AhUf4vlnnBoeVwM93c7Y3 5ezYtvB5 P6sufzu7gpi8eFIpCtiEPAGEgWI1KX6CJb/QjRTdeG7ADkS/0tAJ9+essfa5KcTk1Jld+tBBAASrIEUf4NW5S8pP7zEIm5MrCWXLOMkAcr3kbcJUxJ6ZeQ89Vu9XjHrEtdKnRgkffn6zSxG38iKZj5bknMNIKRgoLsDdAlS+NsZ2rdGNQVEUeER4jFvrELwE1xNzUg5HXxh3j7ImkL6DlDuvstSA64eaFdfGXCRxuT/2mkAcYNgBd/0wo9/TgjiH71joNvANtlfyv8puz7UEjW+Iq5fAuHL3Mj0QYdStqw5IJa/54FLzr7YOwMdD7PigloUFfo4Mc+IQg1S3cD+kWo+tacMoFygWnvpwZEF9BnSL6lc5TN5VaxQ2ezru238P3UwIDhK3o/efCTNH6hDgPm8RryfiWBUXG8YrV X-Bogosity: Ham, tests=bogofilter, spamicity=0.000212, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 10, 2024 at 11:02=E2=80=AFAM Huang, Ying = wrote: > > Yafang Shao writes: > > > Background > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > In our containerized environment, we have a specific type of container > > that runs 18 processes, each consuming approximately 6GB of RSS. These > > processes are organized as separate processes rather than threads due > > to the Python Global Interpreter Lock (GIL) being a bottleneck in a > > multi-threaded setup. Upon the exit of these containers, other > > containers hosted on the same machine experience significant latency > > spikes. > > > > Investigation > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > My investigation using perf tracing revealed that the root cause of > > these spikes is the simultaneous execution of exit_mmap() by each of > > the exiting processes. This concurrent access to the zone->lock > > results in contention, which becomes a hotspot and negatively impacts > > performance. The perf results clearly indicate this contention as a > > primary contributor to the observed latency issues. > > > > + 77.02% 0.00% uwsgi [kernel.kallsyms] = [k] mmput > > - 76.98% 0.01% uwsgi [kernel.kallsyms] = [k] exit_mmap > > - 76.97% exit_mmap > > - 58.58% unmap_vmas > > - 58.55% unmap_single_vma > > - unmap_page_range > > - 58.32% zap_pte_range > > - 42.88% tlb_flush_mmu > > - 42.76% free_pages_and_swap_cache > > - 41.22% release_pages > > - 33.29% free_unref_page_list > > - 32.37% free_unref_page_commit > > - 31.64% free_pcppages_bulk > > + 28.65% _raw_spin_lock > > 1.28% __list_del_entry_valid > > + 3.25% folio_lruvec_lock_irqsave > > + 0.75% __mem_cgroup_uncharge_list > > 0.60% __mod_lruvec_state > > 1.07% free_swap_cache > > + 11.69% page_remove_rmap > > 0.64% __mod_lruvec_page_state > > - 17.34% remove_vma > > - 17.25% vm_area_free > > - 17.23% kmem_cache_free > > - 17.15% __slab_free > > - 14.56% discard_slab > > free_slab > > __free_slab > > __free_pages > > - free_unref_page > > - 13.50% free_unref_page_commit > > - free_pcppages_bulk > > + 13.44% _raw_spin_lock > > I don't think your change will reduce zone->lock contention cycles. So, > I don't find the value of the above data. > > > By enabling the mm_page_pcpu_drain() we can locate the pertinent page, > > with the majority of them being regular order-0 user pages. > > > > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain: = page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp > > e=3D1 > > <...>-1540432 [224] d..3. 618048.023887: > > =3D> free_pcppages_bulk > > =3D> free_unref_page_commit > > =3D> free_unref_page_list > > =3D> release_pages > > =3D> free_pages_and_swap_cache > > =3D> tlb_flush_mmu > > =3D> zap_pte_range > > =3D> unmap_page_range > > =3D> unmap_single_vma > > =3D> unmap_vmas > > =3D> exit_mmap > > =3D> mmput > > =3D> do_exit > > =3D> do_group_exit > > =3D> get_signal > > =3D> arch_do_signal_or_restart > > =3D> exit_to_user_mode_prepare > > =3D> syscall_exit_to_user_mode > > =3D> do_syscall_64 > > =3D> entry_SYSCALL_64_after_hwframe > > > > The servers experiencing these issues are equipped with impressive > > hardware specifications, including 256 CPUs and 1TB of memory, all > > within a single NUMA node. The zoneinfo is as follows, > > > > Node 0, zone Normal > > pages free 144465775 > > boost 0 > > min 1309270 > > low 1636587 > > high 1963904 > > spanned 564133888 > > present 296747008 > > managed 291974346 > > cma 0 > > protection: (0, 0, 0, 0) > > ... > > pagesets > > cpu: 0 > > count: 2217 > > high: 6392 > > batch: 63 > > vm stats threshold: 125 > > cpu: 1 > > count: 4510 > > high: 6392 > > batch: 63 > > vm stats threshold: 125 > > cpu: 2 > > count: 3059 > > high: 6392 > > batch: 63 > > > > ... > > > > The pcp high is around 100 times the batch size. > > > > I also traced the latency associated with the free_pcppages_bulk() > > function during the container exit process: > > > > nsecs : count distribution > > 0 -> 1 : 0 | = | > > 2 -> 3 : 0 | = | > > 4 -> 7 : 0 | = | > > 8 -> 15 : 0 | = | > > 16 -> 31 : 0 | = | > > 32 -> 63 : 0 | = | > > 64 -> 127 : 0 | = | > > 128 -> 255 : 0 | = | > > 256 -> 511 : 148 |***************** = | > > 512 -> 1023 : 334 |**********************************= ******| > > 1024 -> 2047 : 33 |*** = | > > 2048 -> 4095 : 5 | = | > > 4096 -> 8191 : 7 | = | > > 8192 -> 16383 : 12 |* = | > > 16384 -> 32767 : 30 |*** = | > > 32768 -> 65535 : 21 |** = | > > 65536 -> 131071 : 15 |* = | > > 131072 -> 262143 : 27 |*** = | > > 262144 -> 524287 : 84 |********** = | > > 524288 -> 1048575 : 203 |************************ = | > > 1048576 -> 2097151 : 284 |**********************************= | > > 2097152 -> 4194303 : 327 |**********************************= ***** | > > 4194304 -> 8388607 : 215 |************************* = | > > 8388608 -> 16777215 : 116 |************* = | > > 16777216 -> 33554431 : 47 |***** = | > > 33554432 -> 67108863 : 8 | = | > > 67108864 -> 134217727 : 3 | = | > > > > The latency can reach tens of milliseconds. > > > > Experimenting > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > vm.percpu_pagelist_high_fraction > > -------------------------------- > > > > The kernel version currently deployed in our production environment is = the > > stable 6.1.y, and my initial strategy involves optimizing the > > IMHO, we should focus on upstream activity in the cover letter and patch > description. And I don't think that it's necessary to describe the > alternative solution with too much details. > > > vm.percpu_pagelist_high_fraction parameter. By increasing the value of > > vm.percpu_pagelist_high_fraction, I aim to diminish the batch size duri= ng > > page draining, which subsequently leads to a substantial reduction in > > latency. After setting the sysctl value to 0x7fffffff, I observed a not= able > > improvement in latency. > > > > nsecs : count distribution > > 0 -> 1 : 0 | = | > > 2 -> 3 : 0 | = | > > 4 -> 7 : 0 | = | > > 8 -> 15 : 0 | = | > > 16 -> 31 : 0 | = | > > 32 -> 63 : 0 | = | > > 64 -> 127 : 0 | = | > > 128 -> 255 : 120 | = | > > 256 -> 511 : 365 |* = | > > 512 -> 1023 : 201 | = | > > 1024 -> 2047 : 103 | = | > > 2048 -> 4095 : 84 | = | > > 4096 -> 8191 : 87 | = | > > 8192 -> 16383 : 4777 |************** = | > > 16384 -> 32767 : 10572 |******************************* = | > > 32768 -> 65535 : 13544 |**********************************= ******| > > 65536 -> 131071 : 12723 |**********************************= *** | > > 131072 -> 262143 : 8604 |************************* = | > > 262144 -> 524287 : 3659 |********** = | > > 524288 -> 1048575 : 921 |** = | > > 1048576 -> 2097151 : 122 | = | > > 2097152 -> 4194303 : 5 | = | > > > > However, augmenting vm.percpu_pagelist_high_fraction can also decrease = the > > pcp high watermark size to a minimum of four times the batch size. Whil= e > > this could theoretically affect throughput, as highlighted by Ying[0], = we > > have yet to observe any significant difference in throughput within our > > production environment after implementing this change. > > > > Backporting the series "mm: PCP high auto-tuning" > > ------------------------------------------------- > > Again, not upstream activity. We can describe the upstream behavior > directly. Andrew has requested that I provide a more comprehensive analysis of this issue, and in response, I have endeavored to outline all the pertinent details in a thorough and detailed manner. > > > My second endeavor was to backport the series titled > > "mm: PCP high auto-tuning"[1], which comprises nine individual patches, > > into our 6.1.y stable kernel version. Subsequent to its deployment in o= ur > > production environment, I noted a pronounced reduction in latency. The > > observed outcomes are as enumerated below: > > > > nsecs : count distribution > > 0 -> 1 : 0 | = | > > 2 -> 3 : 0 | = | > > 4 -> 7 : 0 | = | > > 8 -> 15 : 0 | = | > > 16 -> 31 : 0 | = | > > 32 -> 63 : 0 | = | > > 64 -> 127 : 0 | = | > > 128 -> 255 : 0 | = | > > 256 -> 511 : 0 | = | > > 512 -> 1023 : 0 | = | > > 1024 -> 2047 : 2 | = | > > 2048 -> 4095 : 11 | = | > > 4096 -> 8191 : 3 | = | > > 8192 -> 16383 : 1 | = | > > 16384 -> 32767 : 2 | = | > > 32768 -> 65535 : 7 | = | > > 65536 -> 131071 : 198 |********* = | > > 131072 -> 262143 : 530 |************************ = | > > 262144 -> 524287 : 824 |**********************************= **** | > > 524288 -> 1048575 : 852 |**********************************= ******| > > 1048576 -> 2097151 : 714 |********************************* = | > > 2097152 -> 4194303 : 389 |****************** = | > > 4194304 -> 8388607 : 143 |****** = | > > 8388608 -> 16777215 : 29 |* = | > > 16777216 -> 33554431 : 1 | = | > > > > Compared to the previous data, the maximum latency has been reduced to > > less than 30ms. > > People don't care too much about page freeing latency during processes > exiting. Instead, they care more about the process exiting time, that > is, throughput. So, it's better to show the page allocation latency > which is affected by the simultaneous processes exiting. I'm confused also. Is this issue really hard to understand ? --=20 Regards Yafang