From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 796AFC30653 for ; Fri, 5 Jul 2024 03:04:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 069E36B007B; Thu, 4 Jul 2024 23:04:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F105B6B009D; Thu, 4 Jul 2024 23:04:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D63A96B009E; Thu, 4 Jul 2024 23:04:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AE72C6B009C for ; Thu, 4 Jul 2024 23:04:17 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 560A91615E6 for ; Fri, 5 Jul 2024 03:04:17 +0000 (UTC) X-FDA: 82304205354.30.B91C0D7 Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com [209.85.219.54]) by imf01.hostedemail.com (Postfix) with ESMTP id 7A8174000C for ; Fri, 5 Jul 2024 03:04:15 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mawIdkJg; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720148642; a=rsa-sha256; cv=none; b=AJaBtfcJVkq4+rqnkfGlBvBCHSr485Q9Fc0tph2WznRbePSP3yPRFVsyQD/bfIKXDC8Z3R rRiEfDU3klpBGY3m/0atGmsL4249RcQtaGZ+YjUL99DrJv0Exlo8YrAvK9aVMXzInQ4E59 QIT5lwNT27i1rffxBnYz7uUGx/cSbsc= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mawIdkJg; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720148642; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=; b=jqOaQw+2M4hT6jugrJF3l1iIUJx/MTa2gpJSpvZUKeQP4vbfYWOf8XE7o8VT2IDWHKaCIq g6UnsJmXpHjKq3szc4aI5xSGESBK8CvA83AlR8XCTNG4pRV0KkmYUp4nlgpEBwWePNXfP8 8wQJiYTnxLfu+Uta2+scoGIfl2bbbmQ= Received: by mail-qv1-f54.google.com with SMTP id 6a1803df08f44-6b42574830fso5080366d6.2 for ; Thu, 04 Jul 2024 20:04:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720148654; x=1720753454; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=; b=mawIdkJgsh6ouwLxe9JGnQLObAsSBWI4lIbQNUrbg21zSZ/dJEcoyuLARfCm0iZfvB qSamEGhzEJwizrC3YsOOiXu/6TcR+wMWWz2kufcn8i3lWb9Zt0NE8GiFozgR3mK08kpg uiE36l1e6NRDAh3J/wBpFVjbQws7OhqSohgOxPGiS+xr8jsVk/xNlpb3WCt3IKh60XWB 5mXKIRNYag2TBrbR9PRBm0bw5b3ORjiUe/zSxPlcH9VOi3WEJvu0cr5AKUALzC6L9ZkN SfWgLjOKhohcAYJb0y3y10HUADJtm3PEstHXjINcP9d9iuxru2BJpYOzSEaq1InUNxKM Iu1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720148654; x=1720753454; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=; b=jbeD8++bZXxpW3OcSLZJQZt6d3Bgb3Pg6lmgi2rwYKcLTt8a/fIxesxwTAO8SJYRpA b2qljBsFSZo4M/bcBDn0g9Yrr8fEBtviDDsXhuaTP5LXf1KUIKfieOI4FnCRGAmgJTyi 5oIjuY1qu//Aby0ikh2W8/WILp5AvlJEh6C/E5IiwBQr79l/OtFRvgNsXsvn6If2izyg xQpeIUYEzPsVkHAEsRm0C52wptGcMxgYlq7aM4d3DR7t09Dvgu/60vVbTeJByhdNdz3B fMRVoMplRFVGM5B/1eZBI6K3VwT9tBpNtEsbytcrK3BODruKQ6G5nd+lvJghLWQTaseg W9qw== X-Forwarded-Encrypted: i=1; AJvYcCX1fzR4/ApBXkRZi2K6afARWyXgK8oilZWG6tkKUfhGTuNE5kitt1AmqryulXTbShUTcnOAavJlQRW7e3qhcIfQK6o= X-Gm-Message-State: AOJu0YxVEK2R/K2lCC9nKXe3Wl7i7jAqR44DaQugwCBOwpQe7N6bdQPY ouMi8aDxrZLJlgCsUKIVpbXP3uavjaOMa9XY1zMhSVU9OZp9fpvgtnGM8UK5uDHsuM1HJPqxTTo aBaPUtn4KKUh1kpjK1UO0sLkFaI4= X-Google-Smtp-Source: AGHT+IEt/HTzdAoYJQZYhJ2iSX1HCYS+WGv+M18kComrmlaz/iugIKRfk3+Tsg03Lpa2b5z+GPvrp98roo+WyjubrrU= X-Received: by 2002:ad4:574d:0:b0:6a0:c903:7226 with SMTP id 6a1803df08f44-6b5ed02aa0bmr42860826d6.34.1720148654425; Thu, 04 Jul 2024 20:04:14 -0700 (PDT) MIME-Version: 1.0 References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Fri, 5 Jul 2024 11:03:37 +0800 Message-ID: Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist To: "Huang, Ying" Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: pczytdeyt7fcosq4rawf4dat7axxwhkc X-Rspamd-Queue-Id: 7A8174000C X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1720148655-442782 X-HE-Meta: U2FsdGVkX181ZLRmZJXP/1+Q1un/XySHetBEV2BLvClkPzxRi7amu8h3Erbn6/Amx6v3+0LdczT5Lva9CaODU7SIMAO3aAz88+85z/0xeBwAnmgD9D0LofBpDBnpU5y6aXo5n1suINPC9ajl6YH1aTmfz9iwBU3AP6Z2qE37vFnEbm2hQ/9iUdBkZU9cW1JhbHi+1hBhiXiUvcZDLE36Uz4Z2GKVMfk7IUM/e6lki0vjSuAJzFIIYRN69FMUoPCb4xyXHKAv7g8yTiYoQoEv7Od5mdre5yRYzYTHZalquc9iJTF9I5U5xne2XMOAGuNZo/b8gqlJO2HNBgEOXNEYhBXNmKODf+b2G3kSbRpCvBqdlUWiV2oG4edfq/EkDCgSApLLHpWVsHw0IOnlYPxO/b5jEQJWfSokTsyVVKGbw/MX80MhoksdQ+iznZYPhl+RoP4pMJRbXIV0BJ0C/buL/u5OyqIHtMYCQ5HcahwUUhV0tjwIJHP79Pgd3yM11qxXV6VIHWozKpBvGJKQZ6cZ9kV9/yS5vJRRHe89xQqpPbm6FlT2vLpOGKNBQh6EYX3w9o85fZH6TIusqVgJ8Q9iZgb801sYxBtl+0X/sNh1VoYEBoqwR/N26AMiOSLHBqqJKmPXYT5TZ6FCWTo0cZJ6j8igwwNYKfx7d4otYxhQmlkRiQ7Po2UlielzI/B8irtA/RjQzmRVVb6w7w3NVnEb3psx30D3H++ISS76z9PNUxpjn53SL9+NtwbgCPAcd9vXpZMRt0/vRymdeDQg1lLPHAHx1XU2zpKMbkldM9an4z3DkXSTY887gUAWThglwH9NwfPqJDBez8U4T95i5qQ9VpXIEmJmW5on1AIag16EM9dOerlZ/hcYNH5CwSHxCiFRG6XgcS8TsszFk4atcLR1mCj/BTPFmvGpEpoDqqi75CFn8BypgrBBDp+UqTY+XQ5RN/b0UHAKmuyK5AoO6Hu Xvb17eb+ 1Ro1la3kYdRwqvuYwKPM88RTsKAmlvY2vIsUtcipwnJUBnh1icbGkiOfE9RkWPfrZON0GaIrAccn2ly7IBgkCnBnvxgUTAMdxvGweOYUCgdpbEB13+79xAI+vBmdjOV84pevSQy79+hMpx/6Y6y+INdndlKJ40Nq/rOSiStq6EWZXN0pG7g1t8E22gCFG4xQEWobPOusnBjZGNUBYWhB8azsC4FsMfA8iaMmQ1T7FH/rZt5iD3w5xMDGH7IbIqj0M4I+r751ht1y/XhOKq0E3M8nP8jEBvTGDBnyGZUoMBZi5gqPwnJB/iGxwVkWYvTbiCbIhCX7rrM6vSjeqGsGecqzQVro7LL94yNA7+Hw7wtJTwCOf9lOui5xwydcZf72eOTbk+WZJVd4DJFjQo86GWexiG8GpCIusExfZzbvFeZ2/sK4j3CkOcUPUzVaIuHcCPGPlj4o+0DgiJCTWsJcrHZB3U3VPfluqKxn1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 5, 2024 at 9:30=E2=80=AFAM Huang, Ying w= rote: > > Yafang Shao writes: > > > On Wed, Jul 3, 2024 at 1:36=E2=80=AFPM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying wrote: > >> >> >> > >> >> >> Yafang Shao writes: > >> >> >> > >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yafang Shao writes: > >> >> >> >> > >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: > >> >> >> >> >> > >> >> >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wrote: > >> >> >> >> >> > >> >> >> >> >> > Currently, we're encountering latency spikes in our cont= ainer environment > >> >> >> >> >> > when a specific container with multiple Python-based tas= ks exits. These > >> >> >> >> >> > tasks may hold the zone->lock for an extended period, si= gnificantly > >> >> >> >> >> > impacting latency for other containers attempting to all= ocate memory. > >> >> >> >> >> > >> >> >> >> >> Is this locking issue well understood? Is anyone working = on it? A > >> >> >> >> >> reasonably detailed description of the issue and a descrip= tion of any > >> >> >> >> >> ongoing work would be helpful here. > >> >> >> >> > > >> >> >> >> > In our containerized environment, we have a specific type o= f container > >> >> >> >> > that runs 18 processes, each consuming approximately 6GB of= RSS. These > >> >> >> >> > processes are organized as separate processes rather than t= hreads due > >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottlen= eck in a > >> >> >> >> > multi-threaded setup. Upon the exit of these containers, ot= her > >> >> >> >> > containers hosted on the same machine experience significan= t latency > >> >> >> >> > spikes. > >> >> >> >> > > >> >> >> >> > Our investigation using perf tracing revealed that the root= cause of > >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() b= y each of > >> >> >> >> > the exiting processes. This concurrent access to the zone->= lock > >> >> >> >> > results in contention, which becomes a hotspot and negative= ly impacts > >> >> >> >> > performance. The perf results clearly indicate this content= ion as a > >> >> >> >> > primary contributor to the observed latency issues. > >> >> >> >> > > >> >> >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] > >> >> >> >> > [k] mmput =E2= =96=92 > >> >> >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] > >> >> >> >> > [k] exit_mmap =E2= =96=92 > >> >> >> >> > - 76.97% exit_mmap > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 58.58% unmap_vmas > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 58.55% unmap_single_vma > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - unmap_page_range > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 58.32% zap_pte_range > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 42.88% tlb_flush_mmu > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 42.76% free_pages_and_swap_cache > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 41.22% release_pages > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 33.29% free_unref_page_list > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 32.37% free_unref_page_comm= it > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 31.64% free_pcppages_bul= k > >> >> >> >> > =E2= =96=92 > >> >> >> >> > + 28.65% _raw_spin_lock > >> >> >> >> > =E2= =96=92 > >> >> >> >> > 1.28% __list_del_entr= y_valid > >> >> >> >> > =E2= =96=92 > >> >> >> >> > + 3.25% folio_lruvec_lock_irqsav= e > >> >> >> >> > =E2= =96=92 > >> >> >> >> > + 0.75% __mem_cgroup_uncharge_li= st > >> >> >> >> > =E2= =96=92 > >> >> >> >> > 0.60% __mod_lruvec_state > >> >> >> >> > =E2= =96=92 > >> >> >> >> > 1.07% free_swap_cache > >> >> >> >> > =E2= =96=92 > >> >> >> >> > + 11.69% page_remove_rmap > >> >> >> >> > =E2= =96=92 > >> >> >> >> > 0.64% __mod_lruvec_page_state > >> >> >> >> > - 17.34% remove_vma > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 17.25% vm_area_free > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 17.23% kmem_cache_free > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 17.15% __slab_free > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 14.56% discard_slab > >> >> >> >> > =E2= =96=92 > >> >> >> >> > free_slab > >> >> >> >> > =E2= =96=92 > >> >> >> >> > __free_slab > >> >> >> >> > =E2= =96=92 > >> >> >> >> > __free_pages > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - free_unref_page > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - 13.50% free_unref_page_commit > >> >> >> >> > =E2= =96=92 > >> >> >> >> > - free_pcppages_bulk > >> >> >> >> > =E2= =96=92 > >> >> >> >> > + 13.44% _raw_spin_lock > >> >> >> >> > > >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detail= ed stack: > >> >> >> >> > > >> >> >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_= pcpu_drain: > >> >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratet= yp > >> >> >> >> > e=3D1 > >> >> >> >> > <...>-1540432 [224] d..3. 618048.023887: > >> >> >> >> > =3D> free_pcppages_bulk > >> >> >> >> > =3D> free_unref_page_commit > >> >> >> >> > =3D> free_unref_page_list > >> >> >> >> > =3D> release_pages > >> >> >> >> > =3D> free_pages_and_swap_cache > >> >> >> >> > =3D> tlb_flush_mmu > >> >> >> >> > =3D> zap_pte_range > >> >> >> >> > =3D> unmap_page_range > >> >> >> >> > =3D> unmap_single_vma > >> >> >> >> > =3D> unmap_vmas > >> >> >> >> > =3D> exit_mmap > >> >> >> >> > =3D> mmput > >> >> >> >> > =3D> do_exit > >> >> >> >> > =3D> do_group_exit > >> >> >> >> > =3D> get_signal > >> >> >> >> > =3D> arch_do_signal_or_restart > >> >> >> >> > =3D> exit_to_user_mode_prepare > >> >> >> >> > =3D> syscall_exit_to_user_mode > >> >> >> >> > =3D> do_syscall_64 > >> >> >> >> > =3D> entry_SYSCALL_64_after_hwframe > >> >> >> >> > > >> >> >> >> > The servers experiencing these issues are equipped with imp= ressive > >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memo= ry, all > >> >> >> >> > within a single NUMA node. The zoneinfo is as follows, > >> >> >> >> > > >> >> >> >> > Node 0, zone Normal > >> >> >> >> > pages free 144465775 > >> >> >> >> > boost 0 > >> >> >> >> > min 1309270 > >> >> >> >> > low 1636587 > >> >> >> >> > high 1963904 > >> >> >> >> > spanned 564133888 > >> >> >> >> > present 296747008 > >> >> >> >> > managed 291974346 > >> >> >> >> > cma 0 > >> >> >> >> > protection: (0, 0, 0, 0) > >> >> >> >> > ... > >> >> >> >> > ... > >> >> >> >> > pagesets > >> >> >> >> > cpu: 0 > >> >> >> >> > count: 2217 > >> >> >> >> > high: 6392 > >> >> >> >> > batch: 63 > >> >> >> >> > vm stats threshold: 125 > >> >> >> >> > cpu: 1 > >> >> >> >> > count: 4510 > >> >> >> >> > high: 6392 > >> >> >> >> > batch: 63 > >> >> >> >> > vm stats threshold: 125 > >> >> >> >> > cpu: 2 > >> >> >> >> > count: 3059 > >> >> >> >> > high: 6392 > >> >> >> >> > batch: 63 > >> >> >> >> > > >> >> >> >> > ... > >> >> >> >> > > >> >> >> >> > The high is around 100 times the batch size. > >> >> >> >> > > >> >> >> >> > We also traced the latency associated with the free_pcppage= s_bulk() > >> >> >> >> > function during the container exit process: > >> >> >> >> > > >> >> >> >> > 19:48:54 > >> >> >> >> > nsecs : count distribution > >> >> >> >> > 0 -> 1 : 0 | = | > >> >> >> >> > 2 -> 3 : 0 | = | > >> >> >> >> > 4 -> 7 : 0 | = | > >> >> >> >> > 8 -> 15 : 0 | = | > >> >> >> >> > 16 -> 31 : 0 | = | > >> >> >> >> > 32 -> 63 : 0 | = | > >> >> >> >> > 64 -> 127 : 0 | = | > >> >> >> >> > 128 -> 255 : 0 | = | > >> >> >> >> > 256 -> 511 : 148 |***************** = | > >> >> >> >> > 512 -> 1023 : 334 |**********************= ******************| > >> >> >> >> > 1024 -> 2047 : 33 |*** = | > >> >> >> >> > 2048 -> 4095 : 5 | = | > >> >> >> >> > 4096 -> 8191 : 7 | = | > >> >> >> >> > 8192 -> 16383 : 12 |* = | > >> >> >> >> > 16384 -> 32767 : 30 |*** = | > >> >> >> >> > 32768 -> 65535 : 21 |** = | > >> >> >> >> > 65536 -> 131071 : 15 |* = | > >> >> >> >> > 131072 -> 262143 : 27 |*** = | > >> >> >> >> > 262144 -> 524287 : 84 |********** = | > >> >> >> >> > 524288 -> 1048575 : 203 |**********************= ** | > >> >> >> >> > 1048576 -> 2097151 : 284 |**********************= ************ | > >> >> >> >> > 2097152 -> 4194303 : 327 |**********************= ***************** | > >> >> >> >> > 4194304 -> 8388607 : 215 |**********************= *** | > >> >> >> >> > 8388608 -> 16777215 : 116 |************* = | > >> >> >> >> > 16777216 -> 33554431 : 47 |***** = | > >> >> >> >> > 33554432 -> 67108863 : 8 | = | > >> >> >> >> > 67108864 -> 134217727 : 3 | = | > >> >> >> >> > > >> >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 > >> >> >> >> > > >> >> >> >> > The latency can reach tens of milliseconds. > >> >> >> >> > > >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter= to set the > >> >> >> >> > minimum pagelist high at 4 times the batch size, we were ab= le to > >> >> >> >> > significantly reduce the latency associated with the > >> >> >> >> > free_pcppages_bulk() function during container exits.: > >> >> >> >> > > >> >> >> >> > nsecs : count distribution > >> >> >> >> > 0 -> 1 : 0 | = | > >> >> >> >> > 2 -> 3 : 0 | = | > >> >> >> >> > 4 -> 7 : 0 | = | > >> >> >> >> > 8 -> 15 : 0 | = | > >> >> >> >> > 16 -> 31 : 0 | = | > >> >> >> >> > 32 -> 63 : 0 | = | > >> >> >> >> > 64 -> 127 : 0 | = | > >> >> >> >> > 128 -> 255 : 120 | = | > >> >> >> >> > 256 -> 511 : 365 |* = | > >> >> >> >> > 512 -> 1023 : 201 | = | > >> >> >> >> > 1024 -> 2047 : 103 | = | > >> >> >> >> > 2048 -> 4095 : 84 | = | > >> >> >> >> > 4096 -> 8191 : 87 | = | > >> >> >> >> > 8192 -> 16383 : 4777 |************** = | > >> >> >> >> > 16384 -> 32767 : 10572 |**********************= ********* | > >> >> >> >> > 32768 -> 65535 : 13544 |**********************= ******************| > >> >> >> >> > 65536 -> 131071 : 12723 |**********************= *************** | > >> >> >> >> > 131072 -> 262143 : 8604 |**********************= *** | > >> >> >> >> > 262144 -> 524287 : 3659 |********** = | > >> >> >> >> > 524288 -> 1048575 : 921 |** = | > >> >> >> >> > 1048576 -> 2097151 : 122 | = | > >> >> >> >> > 2097152 -> 4194303 : 5 | = | > >> >> >> >> > > >> >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 > >> >> >> >> > > >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fract= ion sysctl > >> >> >> >> > knob to set the minimum pagelist high at a level that effec= tively > >> >> >> >> > mitigated latency issues, we observed that other containers= were no > >> >> >> >> > longer experiencing similar complaints. As a result, we dec= ided to > >> >> >> >> > implement this tuning as a permanent workaround and have de= ployed it > >> >> >> >> > across all clusters of servers where these containers may b= e deployed. > >> >> >> >> > >> >> >> >> Thanks for your detailed data. > >> >> >> >> > >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exit= ing > >> >> >> >> shouldn't be a problem? > >> >> >> > > >> >> >> > Right. The problem arises when the process holds the lock for = too > >> >> >> > long, causing other processes that are attempting to allocate = memory > >> >> >> > to experience delays or wait times. > >> >> >> > > >> >> >> >> Because users care more about the total time of > >> >> >> >> process exiting, that is, throughput. And I suspect that the= zone->lock > >> >> >> >> contention and page allocating/freeing throughput will be wor= se with > >> >> >> >> your configuration? > >> >> >> > > >> >> >> > While reducing throughput may not be a significant concern due= to the > >> >> >> > minimal difference, the potential for latency spikes, a crucia= l metric > >> >> >> > for assessing system stability, is of greater concern to users= . Higher > >> >> >> > latency can lead to request errors, impacting the user experie= nce. > >> >> >> > Therefore, maintaining stability, even at the cost of slightly= lower > >> >> >> > throughput, is preferable to experiencing higher throughput wi= th > >> >> >> > unstable performance. > >> >> >> > > >> >> >> >> > >> >> >> >> But the latency of free_pcppages_bulk() and page allocation i= n other > >> >> >> >> processes is a problem. And your configuration can help it. > >> >> >> >> > >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In t= hat way, > >> >> >> >> you have a normal PCP size (high) but smaller PCP batch. I g= uess that > >> >> >> >> may help both latency and throughput in your system. Could y= ou give it > >> >> >> >> a try? > >> >> >> > > >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SC= ALE_MAX > >> >> >> > configuration option. However, I've observed your recent impro= vements > >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 = ("mm: > >> >> >> > restrict the pcp batch scale factor to avoid too long latency"= ), which > >> >> >> > has prompted me to experiment with manually setting the > >> >> >> > pcp->free_factor to zero. While this adjustment provided some > >> >> >> > improvement, the results were not as significant as I had hope= d. > >> >> >> > > >> >> >> > BTW, perhaps we should consider the implementation of a sysctl= knob as > >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow= users > >> >> >> > to more easily adjust it. > >> >> >> > >> >> >> If you cannot test upstream behavior, it's hard to make changes = to > >> >> >> upstream. Could you find a way to do that? > >> >> > > >> >> > I'm afraid I can't run an upstream kernel in our production envir= onment :( > >> >> > Lots of code changes have to be made. > >> >> > >> >> Understand. Can you find a way to test upstream behavior, not upst= ream > >> >> kernel exactly? Or test the upstream kernel but in a similar but n= ot > >> >> exactly production environment. > >> > > >> > I'm willing to give it a try, but it may take some time to achieve t= he > >> > desired results.. > >> > >> Thanks! > > > > After I backported the series "mm: PCP high auto-tuning," which > > consists of a total of 9 patches, to our 6.1.y stable kernel and > > deployed it to our production envrionment, I observed a significant > > reduction in latency. The results are as follows: > > > > nsecs : count distribution > > 0 -> 1 : 0 | = | > > 2 -> 3 : 0 | = | > > 4 -> 7 : 0 | = | > > 8 -> 15 : 0 | = | > > 16 -> 31 : 0 | = | > > 32 -> 63 : 0 | = | > > 64 -> 127 : 0 | = | > > 128 -> 255 : 0 | = | > > 256 -> 511 : 0 | = | > > 512 -> 1023 : 0 | = | > > 1024 -> 2047 : 2 | = | > > 2048 -> 4095 : 11 | = | > > 4096 -> 8191 : 3 | = | > > 8192 -> 16383 : 1 | = | > > 16384 -> 32767 : 2 | = | > > 32768 -> 65535 : 7 | = | > > 65536 -> 131071 : 198 |********* = | > > 131072 -> 262143 : 530 |************************ = | > > 262144 -> 524287 : 824 |**********************************= **** | > > 524288 -> 1048575 : 852 |**********************************= ******| > > 1048576 -> 2097151 : 714 |********************************* = | > > 2097152 -> 4194303 : 389 |****************** = | > > 4194304 -> 8388607 : 143 |****** = | > > 8388608 -> 16777215 : 29 |* = | > > 16777216 -> 33554431 : 1 | = | > > > > avg =3D 1181478 nsecs, total: 4380921824 nsecs, count: 3708 > > > > Compared to the previous data, the maximum latency has been reduced to > > less than 30ms. > > That series can reduce the allocation/freeing from/to the buddy system, > thus reduce the lock contention. > > > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max, > > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning > > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum > > latency was further reduced to less than 2ms. > > > > nsecs : count distribution > > 0 -> 1 : 0 | = | > > 2 -> 3 : 0 | = | > > 4 -> 7 : 0 | = | > > 8 -> 15 : 0 | = | > > 16 -> 31 : 0 | = | > > 32 -> 63 : 0 | = | > > 64 -> 127 : 0 | = | > > 128 -> 255 : 0 | = | > > 256 -> 511 : 0 | = | > > 512 -> 1023 : 0 | = | > > 1024 -> 2047 : 36 | = | > > 2048 -> 4095 : 5063 |***** = | > > 4096 -> 8191 : 31226 |******************************** = | > > 8192 -> 16383 : 37606 |**********************************= ***** | > > 16384 -> 32767 : 38359 |**********************************= ******| > > 32768 -> 65535 : 30652 |******************************* = | > > 65536 -> 131071 : 18714 |******************* = | > > 131072 -> 262143 : 7968 |******** = | > > 262144 -> 524287 : 1996 |** = | > > 524288 -> 1048575 : 302 | = | > > 1048576 -> 2097151 : 19 | = | > > > > avg =3D 40702 nsecs, total: 7002105331 nsecs, count: 172031 > > > > After multiple trials, I observed no significant differences between > > each attempt. > > The test results looks good. > > > Therefore, we decided to backport your improvements to our local > > kernel. Additionally, I propose introducing a new sysctl knob, > > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users > > to easily tune the setting based on their specific workloads. > > The downside is that the pcp->high decaying (in decay_pcp_high()) will > be slower. That is, it will take longer for idle pages to be freed from > PCP to buddy. One possible solution is to keep the decaying page > number, but use a loop as follows to control latency. > > while (count < decay_number) { > spin_lock(); > free_pcppages_bulk(, batch, ); > spin_unlock(); > count -=3D batch; > if (count) > cond_resched(); > } I will try it with this additional change. Thanks for your suggestion. IIUC, the additional change should be as follows? --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) { int high_min, to_drain, batch; - int todo =3D 0; + int todo =3D 0, count =3D 0; high_min =3D READ_ONCE(pcp->high_min); batch =3D READ_ONCE(pcp->batch); @@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) * control latency. This caps pcp->high decrement too. */ if (pcp->high > high_min) { - pcp->high =3D max3(pcp->count - (batch << pcp_batch_scale_m= ax), + pcp->high =3D max3(pcp->count - (batch << 5), pcp->high - (pcp->high >> 3), high_min); if (pcp->high > high_min) todo++; } to_drain =3D pcp->count - pcp->high; - if (to_drain > 0) { + while (count < to_drain) { spin_lock(&pcp->lock); - free_pcppages_bulk(zone, to_drain, pcp, 0); + free_pcppages_bulk(zone, batch, pcp, 0); spin_unlock(&pcp->lock); + count +=3D batch; todo++; + cond_resched(); } --=20 Regards Yafang