From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5017C30653 for ; Wed, 3 Jul 2024 03:44:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 513816B007B; Tue, 2 Jul 2024 23:44:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C2D76B0082; Tue, 2 Jul 2024 23:44:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 364436B0083; Tue, 2 Jul 2024 23:44:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 17BB66B007B for ; Tue, 2 Jul 2024 23:44:48 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 9134BA1F79 for ; Wed, 3 Jul 2024 03:44:47 +0000 (UTC) X-FDA: 82297049814.16.9825F7A Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46]) by imf01.hostedemail.com (Postfix) with ESMTP id BB22C40018 for ; Wed, 3 Jul 2024 03:44:45 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iA7HNWZ8; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719978273; a=rsa-sha256; cv=none; b=b60uvQJ0U1OQxtgHd0PKvZIi+Up54HuvfNP2moFU4qt8SqoOHYygHXMU+CE0CQiXD9buCa N063BCGuf40RheAAy0BHwRK4kLWV2wrt5yvXcVkfkuxd4F3hqJGFNl1S1hVdv5kZlkQccU IK7hc0UEAz/OYWd/k15PMUdfLTa90UI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iA7HNWZ8; spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719978273; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=; b=zI4/3KibfYWZlvlu8/fKz2AuSCUDULFPSzwTRG9JO8B3LbveR6TeSxfRVxkT1bneIB8nGj ADwANNBqtfK23Ylp70XP7XAUvZquqEEzwclBJv/zAazhfGxX/sqwn0QT+ueRk5J0RZaCVo Rdx6zrRPsn+VWmDgov7AnzvvfGItmdg= Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6b4fced5999so22694766d6.2 for ; Tue, 02 Jul 2024 20:44:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719978285; x=1720583085; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=; b=iA7HNWZ8KBmwOiTq2wCHATM3Zr3Dmz5KeHP7UhPeu1MnyW+N1lCVycoVcYnUUKAQwF X0qPA2vSpl6dipxPm6AVWv2gMuQJQKtiF+u+7Ajmoe4sWl3EOnc+h96/TpnE5KjMg0Pl oGOxTK8/4KILuXTqLFjj2KYRBVBOFO/9Yf2hTvkrOD+lEogGV/5s9exFz7eSaBm9snGP uqel0siKH22TSsVBeu1XGJsl6Zrp7VAee2G5uHGfdV29SvIC77pvSLIRlbXOgRl+u+Qf jhRcp4NgP1+Fn1UafNVFaecU7XzVeLMoNqNZDTaO7H4SC+jYW+kTh6gd2+xv99WtErY/ 0Siw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719978285; x=1720583085; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=; b=KPlGz6VEys6Smq0ozKCQiNNDgTW3H6bXqU2MKBKtJaueczB28zjwodBOInCJD4wlrE 3A1WTK6EsWblWb/riSRq1KQqOfubCFC+56B/hhzHcmP3fXiujb5vDGJUcdRVilv2XXZs k8iFJ8hHiqaMLZI0sVvN5zdNjKyG4WmYNyILWgxjw4v/KYZ25Yu12xDn6OEL0g81oqUf soN73GPJaxH5pbvYj4oMUGA2kvEMfYiI/vPHyGyc10ZyfbQaM9YXQo+/s/bbKuCP0DmH U46NRsh2b5ZVHwuocIinNbAaqD70oIkB9Gg9xYSI35p248DXAR5MiXsObTi99s2ozk4Q sxWQ== X-Forwarded-Encrypted: i=1; AJvYcCXefXJ+VjD1XXBouSr3AS0nRtcGRWN+WbDbADFKf63gytAmeyyG81Gt2+Zs0GnDAdiZP9sp3JThvirE4STXG+13xkk= X-Gm-Message-State: AOJu0YwDPRlDIAHo4WAujcCqhLzTdXQxmdYFO22xoK7GuQByN+S/4IzM WuU+PWIE3PfQAHgSzOYflonx6FhPZgGYhxt9M7oB2FPDtw/mVtbHezxLSAjFlo++c3eFvjO7+b9 WA2/r3Nh+TPMfNXeW0BR9jakrkbs= X-Google-Smtp-Source: AGHT+IEA+YGULo6JS1j+FpuzZPQtP/MBe9aPMwkYcZQySc9pinsW7EM0kf6l/Gqb/DlcSqxWgrAVPJeuG7js8aV6o+w= X-Received: by 2002:a05:6214:d61:b0:6b5:6331:4d4 with SMTP id 6a1803df08f44-6b5b71794f8mr119775396d6.51.1719978284755; Tue, 02 Jul 2024 20:44:44 -0700 (PDT) MIME-Version: 1.0 References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yafang Shao Date: Wed, 3 Jul 2024 11:44:08 +0800 Message-ID: Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist To: "Huang, Ying" Cc: Andrew Morton , linux-mm@kvack.org, Matthew Wilcox , David Rientjes , Mel Gorman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: pfokijqn8rr3diged9z48ahsg5tgf1hj X-Rspamd-Queue-Id: BB22C40018 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1719978285-38143 X-HE-Meta: U2FsdGVkX1+cwVtIyRQAZQKaqkjhR5KaGXiKCcyGLZYE+TXFPv2FckKaHvl0HBCLl6Omx6b70Xvs6cUoqaQ++udaM6cuLoGrZ1ZVAqzpQH9y8RyZohhSNL7DLDryfC88E9YxOjs1VXmu72KbWSVInHFrj+IlzjVIW37y2TiASbDy51nhM/2Dpy6aslbWS+XVrE2s84mUjnZAaA8ntWWu4/XVvYMpQ3cvds27bS93sJ8/v6KDoNG1ioQKTIposditLhHkjmpXXafrPOHbQMauXgp+TUR5W+ARIBgGA0a6P0u2X2SeN1Fr5pf9cRkNHWxOeymhjYBpUIqM161ISiCqDTHnKoaOAvVlbJdpo6h0AGJS1jtXwPzBPGmHhkuB++fSCxDZm3PNdZAWKqliXxun8+e7oi2XGCevX3Ipwsez4m80aqyT4KGOEUpBRUiqEQ2gkFnuuxglApK8SWxTb4sR/9P/FTgPF4MtpHXBh4q3Ednl7lhAhlkm6ywN+Xv2XgtB1pIZMomqbcZzuEfh9BOGDis7ZY2hBZRDrakcFbONZFqEkjg0lfhb1XPe540Bpj24SGyhynyEMtnoOiowCMQ/RMfzqdljzLyLFIJZG16uCQmvjo/EoZ8KwxcApLTWvyd8yo2PIf9flOb4elza6QwsKqcpYwaLVIRJ9TJaAIFlDMgRSZxE+TNCTeid54yNY9a8q6FTf3RmMPst5WgHecXlGmn3cb7joGqtHc4fmUvf6U1icxPmyG2dmZu5nJRo7IwKgYQeP/ex/dRzQ1xn2TmUZTbYbGzK3ltr1Z4BlvqN8KWSu3238bHqhaAVE3vlgenw2uBmxrEWwzealnq3RWlk9UNhfLI08gdj2Sht9KIID1P1612d4BFgC1dU5H6kw73xI41Q7KqfrIwYXuj1itnZ3AZ8QIFGh7gLRRD7oPNS8sgPLQRcBN7u84ni0Fu+0gCtvwyBaFL888I8+znp/Na MYABWFEN FoxNKkjq4bcfdYbNXuWWjblphn2ue3PycwS+4jYXN28Rfa31IX9KQSdSNPEW1KmW210VHnvy/5iuEIZJDf8qtmlTg+7PYOoYojVWZcCAWQLVOHUxv16JR5pjBYlLEnR1w5EHL0GwsBGwRCi2fcxbp3RXkqIK4AtN3sassbxl/nzmoKaRnOgoEmyyQ5hT362PUyAB6hRzGq6cjJPizNhNIBk93hT34pZJCzDcWohNTPlWLwRGgAH11wJvYG15gY54gXk7pAB3YFryRFQjQhwmk0Bkd+ZsxDSZUG012QRMNJkBYkOujY4D8Dhf29/kFwVgyD+JmJywKb3ROZuY55VaN9XJ3voTjHICx5tdVL6kuB4FoeZC4itVNnNIN12Q0AmNkwKR5GzO/qVCBJ7mXon+GC6TuZsMTLcqsv+SreYcEYmwMXvtgGL244WVXq+hHNBSu8x+FakGc9DpVGTbSMQBIHqrP7ehnhlDWanDM X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying = wrote: > > Yafang Shao writes: > > > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying wrote: > >> > >> Yafang Shao writes: > >> > >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yafang Shao writes: > >> >> > >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton wrote: > >> >> >> > >> >> >> On Mon, 1 Jul 2024 22:20:46 +0800 Yafang Shao wrote: > >> >> >> > >> >> >> > Currently, we're encountering latency spikes in our container = environment > >> >> >> > when a specific container with multiple Python-based tasks exi= ts. These > >> >> >> > tasks may hold the zone->lock for an extended period, signific= antly > >> >> >> > impacting latency for other containers attempting to allocate = memory. > >> >> >> > >> >> >> Is this locking issue well understood? Is anyone working on it?= A > >> >> >> reasonably detailed description of the issue and a description o= f any > >> >> >> ongoing work would be helpful here. > >> >> > > >> >> > In our containerized environment, we have a specific type of cont= ainer > >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. = These > >> >> > processes are organized as separate processes rather than threads= due > >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in= a > >> >> > multi-threaded setup. Upon the exit of these containers, other > >> >> > containers hosted on the same machine experience significant late= ncy > >> >> > spikes. > >> >> > > >> >> > Our investigation using perf tracing revealed that the root cause= of > >> >> > these spikes is the simultaneous execution of exit_mmap() by each= of > >> >> > the exiting processes. This concurrent access to the zone->lock > >> >> > results in contention, which becomes a hotspot and negatively imp= acts > >> >> > performance. The perf results clearly indicate this contention as= a > >> >> > primary contributor to the observed latency issues. > >> >> > > >> >> > + 77.02% 0.00% uwsgi [kernel.kallsyms] > >> >> > [k] mmput =E2=96=92 > >> >> > - 76.98% 0.01% uwsgi [kernel.kallsyms] > >> >> > [k] exit_mmap =E2=96=92 > >> >> > - 76.97% exit_mmap > >> >> > =E2=96=92 > >> >> > - 58.58% unmap_vmas > >> >> > =E2=96=92 > >> >> > - 58.55% unmap_single_vma > >> >> > =E2=96=92 > >> >> > - unmap_page_range > >> >> > =E2=96=92 > >> >> > - 58.32% zap_pte_range > >> >> > =E2=96=92 > >> >> > - 42.88% tlb_flush_mmu > >> >> > =E2=96=92 > >> >> > - 42.76% free_pages_and_swap_cache > >> >> > =E2=96=92 > >> >> > - 41.22% release_pages > >> >> > =E2=96=92 > >> >> > - 33.29% free_unref_page_list > >> >> > =E2=96=92 > >> >> > - 32.37% free_unref_page_commit > >> >> > =E2=96=92 > >> >> > - 31.64% free_pcppages_bulk > >> >> > =E2=96=92 > >> >> > + 28.65% _raw_spin_lock > >> >> > =E2=96=92 > >> >> > 1.28% __list_del_entry_vali= d > >> >> > =E2=96=92 > >> >> > + 3.25% folio_lruvec_lock_irqsave > >> >> > =E2=96=92 > >> >> > + 0.75% __mem_cgroup_uncharge_list > >> >> > =E2=96=92 > >> >> > 0.60% __mod_lruvec_state > >> >> > =E2=96=92 > >> >> > 1.07% free_swap_cache > >> >> > =E2=96=92 > >> >> > + 11.69% page_remove_rmap > >> >> > =E2=96=92 > >> >> > 0.64% __mod_lruvec_page_state > >> >> > - 17.34% remove_vma > >> >> > =E2=96=92 > >> >> > - 17.25% vm_area_free > >> >> > =E2=96=92 > >> >> > - 17.23% kmem_cache_free > >> >> > =E2=96=92 > >> >> > - 17.15% __slab_free > >> >> > =E2=96=92 > >> >> > - 14.56% discard_slab > >> >> > =E2=96=92 > >> >> > free_slab > >> >> > =E2=96=92 > >> >> > __free_slab > >> >> > =E2=96=92 > >> >> > __free_pages > >> >> > =E2=96=92 > >> >> > - free_unref_page > >> >> > =E2=96=92 > >> >> > - 13.50% free_unref_page_commit > >> >> > =E2=96=92 > >> >> > - free_pcppages_bulk > >> >> > =E2=96=92 > >> >> > + 13.44% _raw_spin_lock > >> >> > > >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed sta= ck: > >> >> > > >> >> > <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_d= rain: > >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp > >> >> > e=3D1 > >> >> > <...>-1540432 [224] d..3. 618048.023887: > >> >> > =3D> free_pcppages_bulk > >> >> > =3D> free_unref_page_commit > >> >> > =3D> free_unref_page_list > >> >> > =3D> release_pages > >> >> > =3D> free_pages_and_swap_cache > >> >> > =3D> tlb_flush_mmu > >> >> > =3D> zap_pte_range > >> >> > =3D> unmap_page_range > >> >> > =3D> unmap_single_vma > >> >> > =3D> unmap_vmas > >> >> > =3D> exit_mmap > >> >> > =3D> mmput > >> >> > =3D> do_exit > >> >> > =3D> do_group_exit > >> >> > =3D> get_signal > >> >> > =3D> arch_do_signal_or_restart > >> >> > =3D> exit_to_user_mode_prepare > >> >> > =3D> syscall_exit_to_user_mode > >> >> > =3D> do_syscall_64 > >> >> > =3D> entry_SYSCALL_64_after_hwframe > >> >> > > >> >> > The servers experiencing these issues are equipped with impressiv= e > >> >> > hardware specifications, including 256 CPUs and 1TB of memory, al= l > >> >> > within a single NUMA node. The zoneinfo is as follows, > >> >> > > >> >> > Node 0, zone Normal > >> >> > pages free 144465775 > >> >> > boost 0 > >> >> > min 1309270 > >> >> > low 1636587 > >> >> > high 1963904 > >> >> > spanned 564133888 > >> >> > present 296747008 > >> >> > managed 291974346 > >> >> > cma 0 > >> >> > protection: (0, 0, 0, 0) > >> >> > ... > >> >> > ... > >> >> > pagesets > >> >> > cpu: 0 > >> >> > count: 2217 > >> >> > high: 6392 > >> >> > batch: 63 > >> >> > vm stats threshold: 125 > >> >> > cpu: 1 > >> >> > count: 4510 > >> >> > high: 6392 > >> >> > batch: 63 > >> >> > vm stats threshold: 125 > >> >> > cpu: 2 > >> >> > count: 3059 > >> >> > high: 6392 > >> >> > batch: 63 > >> >> > > >> >> > ... > >> >> > > >> >> > The high is around 100 times the batch size. > >> >> > > >> >> > We also traced the latency associated with the free_pcppages_bulk= () > >> >> > function during the container exit process: > >> >> > > >> >> > 19:48:54 > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 0 | = | > >> >> > 256 -> 511 : 148 |***************** = | > >> >> > 512 -> 1023 : 334 |****************************= ************| > >> >> > 1024 -> 2047 : 33 |*** = | > >> >> > 2048 -> 4095 : 5 | = | > >> >> > 4096 -> 8191 : 7 | = | > >> >> > 8192 -> 16383 : 12 |* = | > >> >> > 16384 -> 32767 : 30 |*** = | > >> >> > 32768 -> 65535 : 21 |** = | > >> >> > 65536 -> 131071 : 15 |* = | > >> >> > 131072 -> 262143 : 27 |*** = | > >> >> > 262144 -> 524287 : 84 |********** = | > >> >> > 524288 -> 1048575 : 203 |************************ = | > >> >> > 1048576 -> 2097151 : 284 |****************************= ****** | > >> >> > 2097152 -> 4194303 : 327 |****************************= *********** | > >> >> > 4194304 -> 8388607 : 215 |************************* = | > >> >> > 8388608 -> 16777215 : 116 |************* = | > >> >> > 16777216 -> 33554431 : 47 |***** = | > >> >> > 33554432 -> 67108863 : 8 | = | > >> >> > 67108864 -> 134217727 : 3 | = | > >> >> > > >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920 > >> >> > > >> >> > The latency can reach tens of milliseconds. > >> >> > > >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to se= t the > >> >> > minimum pagelist high at 4 times the batch size, we were able to > >> >> > significantly reduce the latency associated with the > >> >> > free_pcppages_bulk() function during container exits.: > >> >> > > >> >> > nsecs : count distribution > >> >> > 0 -> 1 : 0 | = | > >> >> > 2 -> 3 : 0 | = | > >> >> > 4 -> 7 : 0 | = | > >> >> > 8 -> 15 : 0 | = | > >> >> > 16 -> 31 : 0 | = | > >> >> > 32 -> 63 : 0 | = | > >> >> > 64 -> 127 : 0 | = | > >> >> > 128 -> 255 : 120 | = | > >> >> > 256 -> 511 : 365 |* = | > >> >> > 512 -> 1023 : 201 | = | > >> >> > 1024 -> 2047 : 103 | = | > >> >> > 2048 -> 4095 : 84 | = | > >> >> > 4096 -> 8191 : 87 | = | > >> >> > 8192 -> 16383 : 4777 |************** = | > >> >> > 16384 -> 32767 : 10572 |****************************= *** | > >> >> > 32768 -> 65535 : 13544 |****************************= ************| > >> >> > 65536 -> 131071 : 12723 |****************************= ********* | > >> >> > 131072 -> 262143 : 8604 |************************* = | > >> >> > 262144 -> 524287 : 3659 |********** = | > >> >> > 524288 -> 1048575 : 921 |** = | > >> >> > 1048576 -> 2097151 : 122 | = | > >> >> > 2097152 -> 4194303 : 5 | = | > >> >> > > >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925 > >> >> > > >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sy= sctl > >> >> > knob to set the minimum pagelist high at a level that effectively > >> >> > mitigated latency issues, we observed that other containers were = no > >> >> > longer experiencing similar complaints. As a result, we decided t= o > >> >> > implement this tuning as a permanent workaround and have deployed= it > >> >> > across all clusters of servers where these containers may be depl= oyed. > >> >> > >> >> Thanks for your detailed data. > >> >> > >> >> IIUC, the latency of free_pcppages_bulk() during process exiting > >> >> shouldn't be a problem? > >> > > >> > Right. The problem arises when the process holds the lock for too > >> > long, causing other processes that are attempting to allocate memory > >> > to experience delays or wait times. > >> > > >> >> Because users care more about the total time of > >> >> process exiting, that is, throughput. And I suspect that the zone-= >lock > >> >> contention and page allocating/freeing throughput will be worse wit= h > >> >> your configuration? > >> > > >> > While reducing throughput may not be a significant concern due to th= e > >> > minimal difference, the potential for latency spikes, a crucial metr= ic > >> > for assessing system stability, is of greater concern to users. High= er > >> > latency can lead to request errors, impacting the user experience. > >> > Therefore, maintaining stability, even at the cost of slightly lower > >> > throughput, is preferable to experiencing higher throughput with > >> > unstable performance. > >> > > >> >> > >> >> But the latency of free_pcppages_bulk() and page allocation in othe= r > >> >> processes is a problem. And your configuration can help it. > >> >> > >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX. In that wa= y, > >> >> you have a normal PCP size (high) but smaller PCP batch. I guess t= hat > >> >> may help both latency and throughput in your system. Could you giv= e it > >> >> a try? > >> > > >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MA= X > >> > configuration option. However, I've observed your recent improvement= s > >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm: > >> > restrict the pcp batch scale factor to avoid too long latency"), whi= ch > >> > has prompted me to experiment with manually setting the > >> > pcp->free_factor to zero. While this adjustment provided some > >> > improvement, the results were not as significant as I had hoped. > >> > > >> > BTW, perhaps we should consider the implementation of a sysctl knob = as > >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users > >> > to more easily adjust it. > >> > >> If you cannot test upstream behavior, it's hard to make changes to > >> upstream. Could you find a way to do that? > > > > I'm afraid I can't run an upstream kernel in our production environment= :( > > Lots of code changes have to be made. > > Understand. Can you find a way to test upstream behavior, not upstream > kernel exactly? Or test the upstream kernel but in a similar but not > exactly production environment. I'm willing to give it a try, but it may take some time to achieve the desired results.. > > >> IIUC, PCP high will not influence allocate/free latency, PCP batch wil= l. > > > > It seems incorrect. > > Looks at the code in free_unref_page_commit() : > > > > if (pcp->count >=3D high) { > > free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_hig= h), > > pcp, pindex); > > } > > > > And nr_pcp_free() : > > min_nr_free =3D batch; > > max_nr_free =3D high - batch; > > > > batch =3D clamp_t(int, pcp->free_count, min_nr_free, max_nr_free); > > return batch; > > > > The 'batch' is not a fixed value but changed dynamically, isn't it ? > > Sorry, my words were confusing. For 'batch', I mean the value of the > "count" parameter of free_pcppages_bulk() actually. For example, if we > change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that. If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is that the pcp->free_count should not exceed (63 << 0), right ? (suppose 63 is the default batch size) However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) , is that expected ? Perhaps we should make the change below? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e7313f9d704b..8c52a30201d1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &=3D ~PCPF_PREV_FREE_HIGH_ORDER; } - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) + if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) { pcp->free_count +=3D (1 << order); + if (unlikely(pcp->free_count > (batch << CONFIG_PCP_BATCH_SCALE_MAX))) + pcp->free_count =3D batch << CONFIG_PCP_BATCH_SCALE= _MAX; + } high =3D nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >=3D high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), --=20 Regards Yafang