From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 95C02C30658
	for <linux-mm@archiver.kernel.org>; Fri,  5 Jul 2024 05:33:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CE61F6B0098; Fri,  5 Jul 2024 01:33:03 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C95CE6B0099; Fri,  5 Jul 2024 01:33:03 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B11016B009A; Fri,  5 Jul 2024 01:33:03 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 8AAEB6B0098
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 01:33:03 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 07BFF121615
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 05:33:03 +0000 (UTC)
X-FDA: 82304580246.29.75AE26A
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11])
	by imf20.hostedemail.com (Postfix) with ESMTP id AB4CF1C0015
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 05:33:00 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UkJ9jUbw;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720157567; a=rsa-sha256;
	cv=none;
	b=Nvq6MOlWDNDKMavvGl39MdXJaghG7qUykWT9p5E0VJ8dtJTZuPewNIt/4RvuYqGnNrtxSQ
	o41/YMs7T3f5YCp9bUwSjc6/r010zHKcfnuxl5Xo0bPCrmfRp1Yxjt08PjN60QXFu8Qmdb
	/Scp76ft9OesKIR4B/RbMrzXARTFLNw=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=UkJ9jUbw;
	spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720157567;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vZ1RxkoFSJBXUu728sDmmouRyvcSBeQiMbSX2ikLNIA=;
	b=gzTjWCEuPHAZxegRBAOr1ae+9RwLXR+4XT2pnwFfAMgJiQgmUWGQRi1+fBWx/CqqMEpKeM
	XnqpujHUSe74PHV1O8QxUfjWjtDejqnesroyfqFLDqwexuwnDgmfmxB8a0xUq3Ih46xxI/
	zXSQWo0nntJ+ASgUthMYsZ6+WCs0UyI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1720157581; x=1751693581;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=YJ/Hlwr4LLEB0WMCKdIBCLqxhiUpu+Lf/f1yXxQh9y4=;
  b=UkJ9jUbwsJn29DG48j/hlpZ1E0bJy7FqTg7wxTbyxOEJeiui6kVzvjHv
   ER3fcIbr2GOO0nN6i9wFuMd2c93sm1vxzGdYEhzmkKaKNvjRntEiTorfV
   C+EZvo0lDbvyvee/i+ei1jsotcvyT/8CryoYlyejZZm9GpG5KFxv7KQif
   9Nbj2AI13CA9IU1Pufgwi7WC5vsfUmHeIGln1LfvqNKF3j59LCJLrXZTn
   BOQ3M7Ay2VC7r19w9Z+f895QXQcX7UWnplrnwqDNlokjwU5xamLQVvmEc
   0R4j5NH5mV1dHb72FQUX2kd6EH5OWdBGs2xOAYxtBlL1qFR8X95Wj4hi5
   A==;
X-CSE-ConnectionGUID: LtfPAuwdTdeLEzdOIClw3Q==
X-CSE-MsgGUID: SU1MQlPtRWW/gMTtiDuUSw==
X-IronPort-AV: E=McAfee;i="6700,10204,11123"; a="28042112"
X-IronPort-AV: E=Sophos;i="6.09,184,1716274800"; 
   d="scan'208";a="28042112"
Received: from orviesa006.jf.intel.com ([10.64.159.146])
  by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 22:33:00 -0700
X-CSE-ConnectionGUID: tM8LoPUkRpiUDCLhZzU/3A==
X-CSE-MsgGUID: EH+Zk8qXQ72oeqnFgJg54g==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,184,1716274800"; 
   d="scan'208";a="47207674"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 Jul 2024 22:32:58 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,  linux-mm@kvack.org,  Matthew
 Wilcox <willy@infradead.org>,  David Rientjes <rientjes@google.com>,  Mel
 Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for
 vm.percpu_pagelist_high_fraction to set the minimum pagelist
In-Reply-To: <CALOAHbD80qgVm-S=BBZ0tD0zjMu4-J9e51mq_9cDeRzehd1gQw@mail.gmail.com>
	(Yafang Shao's message of "Fri, 5 Jul 2024 11:03:37 +0800")
References: <20240701142046.6050-1-laoar.shao@gmail.com>
	<20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
	<CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
	<874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com>
	<87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbC4_ZFW+U+ORTbaPc1h_iVKnV-kTTbATQaKk6w8RV5O8A@mail.gmail.com>
	<87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbC1ckPVx60Ft9CAFmjYMK3asN+x=5E8Qj+YQ6YZSvCsJg@mail.gmail.com>
	<87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbACG64faVearXF-eJQLVc4Viv=ShOtpLeQSfVwx2tdr=w@mail.gmail.com>
	<87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbD80qgVm-S=BBZ0tD0zjMu4-J9e51mq_9cDeRzehd1gQw@mail.gmail.com>
Date: Fri, 05 Jul 2024 13:31:06 +0800
Message-ID: <87cynse76t.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: stoeim53jfmcqbrr4t6f589d7m3w96fo
X-Rspamd-Queue-Id: AB4CF1C0015
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1720157580-113155
X-HE-Meta: U2FsdGVkX19ZKnqLC4JRm9FZP8UynRGkf1QaPZuqYuyxrC3yyxWx2HkmxKXS/2uvWfFTvNhSxyDFi5PrdGjM8O5LvlrorXdJbndOH78zvy4MQETGge/+NGlNHDzu8KyB9eNvIQZ9eQhbYBzGxAXoUuZpHQwJ3DVY487ddlup6SNgxrvqABDAURjmo2hKDyS0hAHzXsHoXb3goSXPrBGD0YV7NZrsEVFwuXbliaY5rXpKSP2ZiXIakfoufcKa5PaupILeo+ZUEx2N3dYdIUhoxSkrR9h8EeRO2SmNf50ZwgZywWJrLaZELi8Q85ZyNV8Mmf2f2BEOCQUPFraLRaGsLLbXco2ilKW8X6GJ7vjYUEhsWQfzlexJAWtsA9smQbOeuuVws0IhLCO1xKyylK51EtXoDm7akM43wX5qzR9au1CFm86ChUu7wLmjeZ6GyQB5i5IjzZPVP+nHGheAwiTrI4/AZI61jKUg7Ye6+Pqih9UXxT3xXWdwBUHVIaLCt8QPQZ0HELhdcetKzkKtHuO6yAAwieoEssMim8zmU9uY90TIKKAPUenmp9Ij36Vz8Ma5Hc40KOzd9SNeJiPOv2xBpoOpSB7Iu+Pp7HavVHiNVAM5nnfhYVjo1EemLUf7orGNeYOEjYaaUKzUIzG+1NvqqGSNlpxKFXdGJj29jGpZGdmANDDrKAUeN7l/bUbpYuRaVQ/S2bCzPHJXRE/KISCy3yi6xQpKzm6D9MIWcGQcgOK5ASOWJFuw2MhtheaXXJPwPR6c16C6t981+GbfHKnUMoa/J3AkSuoTZNUE1/4MeGNu8Uw9NJNprLk999ARkDntMn4BRiGIjfE5pi1/Xi74VLKWqZT7A4F1Nv2JZD3nbQIHWhiOx8GBxVQCw+VLVr5X122s57ji0/PO7PXtYd4PbM+g9ar1vlfEM+6NUSwNk4rJMvEHC2c5hinbzIWHVDhnMHfyJVGxmjrnW/b8esP
 nwvRWwtS
 pbFWkUoEsxGlm2RvUUwGAp6zKNujZHfauKhs0oRhBBiWQw92XivHHQr2uxzhors70Y4xXApWu7UaqmMu3NcwbC2UVZGDgfTX8Ihr51LiaFPZSog7rZf6qceETZ/yj8jxwSfkBnzX4AhziYOpSZ9ZT+LkId+JWuyqAO1RiLGNtOOUwCx8MP3xi7iSA5bRrNtkSaL0+FOOJ0s9kjSCtwT3MI0m5f2KaeJjejEUTthFKo9TRnM8prixCLH2DHpOR3d9mgOxAmCWLBvhg2ZBKoMKHDB+J+1xdTbXygcMqpHwPwzOjq20Q4VaPMV1OZbOaIFxIk8yAZIwHUj/frIlaI0eKKxTrO62jtxbh12fVJPabkXvVgvSakt0Vr/JLNA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Fri, Jul 5, 2024 at 9:30=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 3, 2024 at 1:36=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying <ying.huang@int=
el.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying <ying.huang@i=
ntel.com> wrote:
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >>
>> >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying <ying.huan=
g@intel.com> wrote:
>> >> >> >> >>
>> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >> >> >>
>> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akp=
m@linux-foundation.org> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.sha=
o@gmail.com> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> > Currently, we're encountering latency spikes in our con=
tainer environment
>> >> >> >> >> >> > when a specific container with multiple Python-based ta=
sks exits. These
>> >> >> >> >> >> > tasks may hold the zone->lock for an extended period, s=
ignificantly
>> >> >> >> >> >> > impacting latency for other containers attempting to al=
locate memory.
>> >> >> >> >> >>
>> >> >> >> >> >> Is this locking issue well understood?  Is anyone working=
 on it?  A
>> >> >> >> >> >> reasonably detailed description of the issue and a descri=
ption of any
>> >> >> >> >> >> ongoing work would be helpful here.
>> >> >> >> >> >
>> >> >> >> >> > In our containerized environment, we have a specific type =
of container
>> >> >> >> >> > that runs 18 processes, each consuming approximately 6GB o=
f RSS. These
>> >> >> >> >> > processes are organized as separate processes rather than =
threads due
>> >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottle=
neck in a
>> >> >> >> >> > multi-threaded setup. Upon the exit of these containers, o=
ther
>> >> >> >> >> > containers hosted on the same machine experience significa=
nt latency
>> >> >> >> >> > spikes.
>> >> >> >> >> >
>> >> >> >> >> > Our investigation using perf tracing revealed that the roo=
t cause of
>> >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() =
by each of
>> >> >> >> >> > the exiting processes. This concurrent access to the zone-=
>lock
>> >> >> >> >> > results in contention, which becomes a hotspot and negativ=
ely impacts
>> >> >> >> >> > performance. The perf results clearly indicate this conten=
tion as a
>> >> >> >> >> > primary contributor to the observed latency issues.
>> >> >> >> >> >
>> >> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >> >> >> >            [k] mmput                                   =E2=
=96=92
>> >> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >> >> >> >            [k] exit_mmap                               =E2=
=96=92
>> >> >> >> >> >    - 76.97% exit_mmap
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >       - 58.58% unmap_vmas
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >          - 58.55% unmap_single_vma
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >             - unmap_page_range
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                - 58.32% zap_pte_range
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                         - 41.22% release_pages
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                               - 32.37% free_unref_page_com=
mit
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                                  - 31.64% free_pcppages_bu=
lk
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                                       1.28% __list_del_ent=
ry_valid
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsa=
ve
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_l=
ist
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                           1.07% free_swap_cache
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >> >> >       - 17.34% remove_vma
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >          - 17.25% vm_area_free
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >             - 17.23% kmem_cache_free
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                - 17.15% __slab_free
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                   - 14.56% discard_slab
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                        free_slab
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                        __free_slab
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                        __free_pages
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                      - free_unref_page
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                            - free_pcppages_bulk
>> >> >> >> >> >                                                        =E2=
=96=92
>> >> >> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >> >> >
>> >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detai=
led stack:
>> >> >> >> >> >
>> >> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page=
_pcpu_drain:
>> >> >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migrate=
typ
>> >> >> >> >> > e=3D1
>> >> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack=
 trace>
>> >> >> >> >> >  =3D> free_pcppages_bulk
>> >> >> >> >> >  =3D> free_unref_page_commit
>> >> >> >> >> >  =3D> free_unref_page_list
>> >> >> >> >> >  =3D> release_pages
>> >> >> >> >> >  =3D> free_pages_and_swap_cache
>> >> >> >> >> >  =3D> tlb_flush_mmu
>> >> >> >> >> >  =3D> zap_pte_range
>> >> >> >> >> >  =3D> unmap_page_range
>> >> >> >> >> >  =3D> unmap_single_vma
>> >> >> >> >> >  =3D> unmap_vmas
>> >> >> >> >> >  =3D> exit_mmap
>> >> >> >> >> >  =3D> mmput
>> >> >> >> >> >  =3D> do_exit
>> >> >> >> >> >  =3D> do_group_exit
>> >> >> >> >> >  =3D> get_signal
>> >> >> >> >> >  =3D> arch_do_signal_or_restart
>> >> >> >> >> >  =3D> exit_to_user_mode_prepare
>> >> >> >> >> >  =3D> syscall_exit_to_user_mode
>> >> >> >> >> >  =3D> do_syscall_64
>> >> >> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
>> >> >> >> >> >
>> >> >> >> >> > The servers experiencing these issues are equipped with im=
pressive
>> >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of mem=
ory, all
>> >> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >> >> >
>> >> >> >> >> > Node 0, zone   Normal
>> >> >> >> >> >   pages free     144465775
>> >> >> >> >> >         boost    0
>> >> >> >> >> >         min      1309270
>> >> >> >> >> >         low      1636587
>> >> >> >> >> >         high     1963904
>> >> >> >> >> >         spanned  564133888
>> >> >> >> >> >         present  296747008
>> >> >> >> >> >         managed  291974346
>> >> >> >> >> >         cma      0
>> >> >> >> >> >         protection: (0, 0, 0, 0)
>> >> >> >> >> > ...
>> >> >> >> >> > ...
>> >> >> >> >> >   pagesets
>> >> >> >> >> >     cpu: 0
>> >> >> >> >> >               count: 2217
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >   vm stats threshold: 125
>> >> >> >> >> >     cpu: 1
>> >> >> >> >> >               count: 4510
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >   vm stats threshold: 125
>> >> >> >> >> >     cpu: 2
>> >> >> >> >> >               count: 3059
>> >> >> >> >> >               high:  6392
>> >> >> >> >> >               batch: 63
>> >> >> >> >> >
>> >> >> >> >> > ...
>> >> >> >> >> >
>> >> >> >> >> > The high is around 100 times the batch size.
>> >> >> >> >> >
>> >> >> >> >> > We also traced the latency associated with the free_pcppag=
es_bulk()
>> >> >> >> >> > function during the container exit process:
>> >> >> >> >> >
>> >> >> >> >> > 19:48:54
>> >> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >> >          0 -> 1          : 0        |                     =
                   |
>> >> >> >> >> >          2 -> 3          : 0        |                     =
                   |
>> >> >> >> >> >          4 -> 7          : 0        |                     =
                   |
>> >> >> >> >> >          8 -> 15         : 0        |                     =
                   |
>> >> >> >> >> >         16 -> 31         : 0        |                     =
                   |
>> >> >> >> >> >         32 -> 63         : 0        |                     =
                   |
>> >> >> >> >> >         64 -> 127        : 0        |                     =
                   |
>> >> >> >> >> >        128 -> 255        : 0        |                     =
                   |
>> >> >> >> >> >        256 -> 511        : 148      |*****************    =
                   |
>> >> >> >> >> >        512 -> 1023       : 334      |*********************=
*******************|
>> >> >> >> >> >       1024 -> 2047       : 33       |***                  =
                   |
>> >> >> >> >> >       2048 -> 4095       : 5        |                     =
                   |
>> >> >> >> >> >       4096 -> 8191       : 7        |                     =
                   |
>> >> >> >> >> >       8192 -> 16383      : 12       |*                    =
                   |
>> >> >> >> >> >      16384 -> 32767      : 30       |***                  =
                   |
>> >> >> >> >> >      32768 -> 65535      : 21       |**                   =
                   |
>> >> >> >> >> >      65536 -> 131071     : 15       |*                    =
                   |
>> >> >> >> >> >     131072 -> 262143     : 27       |***                  =
                   |
>> >> >> >> >> >     262144 -> 524287     : 84       |**********           =
                   |
>> >> >> >> >> >     524288 -> 1048575    : 203      |*********************=
***                |
>> >> >> >> >> >    1048576 -> 2097151    : 284      |*********************=
*************      |
>> >> >> >> >> >    2097152 -> 4194303    : 327      |*********************=
****************** |
>> >> >> >> >> >    4194304 -> 8388607    : 215      |*********************=
****               |
>> >> >> >> >> >    8388608 -> 16777215   : 116      |*************        =
                   |
>> >> >> >> >> >   16777216 -> 33554431   : 47       |*****                =
                   |
>> >> >> >> >> >   33554432 -> 67108863   : 8        |                     =
                   |
>> >> >> >> >> >   67108864 -> 134217727  : 3        |                     =
                   |
>> >> >> >> >> >
>> >> >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >> >> >> >
>> >> >> >> >> > The latency can reach tens of milliseconds.
>> >> >> >> >> >
>> >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction paramete=
r to set the
>> >> >> >> >> > minimum pagelist high at 4 times the batch size, we were a=
ble to
>> >> >> >> >> > significantly reduce the latency associated with the
>> >> >> >> >> > free_pcppages_bulk() function during container exits.:
>> >> >> >> >> >
>> >> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >> >          0 -> 1          : 0        |                     =
                   |
>> >> >> >> >> >          2 -> 3          : 0        |                     =
                   |
>> >> >> >> >> >          4 -> 7          : 0        |                     =
                   |
>> >> >> >> >> >          8 -> 15         : 0        |                     =
                   |
>> >> >> >> >> >         16 -> 31         : 0        |                     =
                   |
>> >> >> >> >> >         32 -> 63         : 0        |                     =
                   |
>> >> >> >> >> >         64 -> 127        : 0        |                     =
                   |
>> >> >> >> >> >        128 -> 255        : 120      |                     =
                   |
>> >> >> >> >> >        256 -> 511        : 365      |*                    =
                   |
>> >> >> >> >> >        512 -> 1023       : 201      |                     =
                   |
>> >> >> >> >> >       1024 -> 2047       : 103      |                     =
                   |
>> >> >> >> >> >       2048 -> 4095       : 84       |                     =
                   |
>> >> >> >> >> >       4096 -> 8191       : 87       |                     =
                   |
>> >> >> >> >> >       8192 -> 16383      : 4777     |**************       =
                   |
>> >> >> >> >> >      16384 -> 32767      : 10572    |*********************=
**********         |
>> >> >> >> >> >      32768 -> 65535      : 13544    |*********************=
*******************|
>> >> >> >> >> >      65536 -> 131071     : 12723    |*********************=
****************   |
>> >> >> >> >> >     131072 -> 262143     : 8604     |*********************=
****               |
>> >> >> >> >> >     262144 -> 524287     : 3659     |**********           =
                   |
>> >> >> >> >> >     524288 -> 1048575    : 921      |**                   =
                   |
>> >> >> >> >> >    1048576 -> 2097151    : 122      |                     =
                   |
>> >> >> >> >> >    2097152 -> 4194303    : 5        |                     =
                   |
>> >> >> >> >> >
>> >> >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >> >> >> >
>> >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_frac=
tion sysctl
>> >> >> >> >> > knob to set the minimum pagelist high at a level that effe=
ctively
>> >> >> >> >> > mitigated latency issues, we observed that other container=
s were no
>> >> >> >> >> > longer experiencing similar complaints. As a result, we de=
cided to
>> >> >> >> >> > implement this tuning as a permanent workaround and have d=
eployed it
>> >> >> >> >> > across all clusters of servers where these containers may =
be deployed.
>> >> >> >> >>
>> >> >> >> >> Thanks for your detailed data.
>> >> >> >> >>
>> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exi=
ting
>> >> >> >> >> shouldn't be a problem?
>> >> >> >> >
>> >> >> >> > Right. The problem arises when the process holds the lock for=
 too
>> >> >> >> > long, causing other processes that are attempting to allocate=
 memory
>> >> >> >> > to experience delays or wait times.
>> >> >> >> >
>> >> >> >> >> Because users care more about the total time of
>> >> >> >> >> process exiting, that is, throughput.  And I suspect that th=
e zone->lock
>> >> >> >> >> contention and page allocating/freeing throughput will be wo=
rse with
>> >> >> >> >> your configuration?
>> >> >> >> >
>> >> >> >> > While reducing throughput may not be a significant concern du=
e to the
>> >> >> >> > minimal difference, the potential for latency spikes, a cruci=
al metric
>> >> >> >> > for assessing system stability, is of greater concern to user=
s. Higher
>> >> >> >> > latency can lead to request errors, impacting the user experi=
ence.
>> >> >> >> > Therefore, maintaining stability, even at the cost of slightl=
y lower
>> >> >> >> > throughput, is preferable to experiencing higher throughput w=
ith
>> >> >> >> > unstable performance.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation =
in other
>> >> >> >> >> processes is a problem.  And your configuration can help it.
>> >> >> >> >>
>> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In =
that way,
>> >> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I =
guess that
>> >> >> >> >> may help both latency and throughput in your system.  Could =
you give it
>> >> >> >> >> a try?
>> >> >> >> >
>> >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_S=
CALE_MAX
>> >> >> >> > configuration option. However, I've observed your recent impr=
ovements
>> >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9=
 ("mm:
>> >> >> >> > restrict the pcp batch scale factor to avoid too long latency=
"), which
>> >> >> >> > has prompted me to experiment with manually setting the
>> >> >> >> > pcp->free_factor to zero. While this adjustment provided some
>> >> >> >> > improvement, the results were not as significant as I had hop=
ed.
>> >> >> >> >
>> >> >> >> > BTW, perhaps we should consider the implementation of a sysct=
l knob as
>> >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allo=
w users
>> >> >> >> > to more easily adjust it.
>> >> >> >>
>> >> >> >> If you cannot test upstream behavior, it's hard to make changes=
 to
>> >> >> >> upstream.  Could you find a way to do that?
>> >> >> >
>> >> >> > I'm afraid I can't run an upstream kernel in our production envi=
ronment :(
>> >> >> > Lots of code changes have to be made.
>> >> >>
>> >> >> Understand.  Can you find a way to test upstream behavior, not ups=
tream
>> >> >> kernel exactly?  Or test the upstream kernel but in a similar but =
not
>> >> >> exactly production environment.
>> >> >
>> >> > I'm willing to give it a try, but it may take some time to achieve =
the
>> >> > desired results..
>> >>
>> >> Thanks!
>> >
>> > After I backported the series "mm: PCP high auto-tuning," which
>> > consists of a total of 9 patches, to our 6.1.y stable kernel and
>> > deployed it to our production envrionment, I observed a significant
>> > reduction in latency. The results are as follows:
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                 =
       |
>> >          2 -> 3          : 0        |                                 =
       |
>> >          4 -> 7          : 0        |                                 =
       |
>> >          8 -> 15         : 0        |                                 =
       |
>> >         16 -> 31         : 0        |                                 =
       |
>> >         32 -> 63         : 0        |                                 =
       |
>> >         64 -> 127        : 0        |                                 =
       |
>> >        128 -> 255        : 0        |                                 =
       |
>> >        256 -> 511        : 0        |                                 =
       |
>> >        512 -> 1023       : 0        |                                 =
       |
>> >       1024 -> 2047       : 2        |                                 =
       |
>> >       2048 -> 4095       : 11       |                                 =
       |
>> >       4096 -> 8191       : 3        |                                 =
       |
>> >       8192 -> 16383      : 1        |                                 =
       |
>> >      16384 -> 32767      : 2        |                                 =
       |
>> >      32768 -> 65535      : 7        |                                 =
       |
>> >      65536 -> 131071     : 198      |*********                        =
       |
>> >     131072 -> 262143     : 530      |************************         =
       |
>> >     262144 -> 524287     : 824      |*********************************=
*****  |
>> >     524288 -> 1048575    : 852      |*********************************=
*******|
>> >    1048576 -> 2097151    : 714      |*********************************=
       |
>> >    2097152 -> 4194303    : 389      |******************               =
       |
>> >    4194304 -> 8388607    : 143      |******                           =
       |
>> >    8388608 -> 16777215   : 29       |*                                =
       |
>> >   16777216 -> 33554431   : 1        |                                 =
       |
>> >
>> > avg =3D 1181478 nsecs, total: 4380921824 nsecs, count: 3708
>> >
>> > Compared to the previous data, the maximum latency has been reduced to
>> > less than 30ms.
>>
>> That series can reduce the allocation/freeing from/to the buddy system,
>> thus reduce the lock contention.
>>
>> > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
>> > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
>> > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
>> > latency was further reduced to less than 2ms.
>> >
>> >      nsecs               : count     distribution
>> >          0 -> 1          : 0        |                                 =
       |
>> >          2 -> 3          : 0        |                                 =
       |
>> >          4 -> 7          : 0        |                                 =
       |
>> >          8 -> 15         : 0        |                                 =
       |
>> >         16 -> 31         : 0        |                                 =
       |
>> >         32 -> 63         : 0        |                                 =
       |
>> >         64 -> 127        : 0        |                                 =
       |
>> >        128 -> 255        : 0        |                                 =
       |
>> >        256 -> 511        : 0        |                                 =
       |
>> >        512 -> 1023       : 0        |                                 =
       |
>> >       1024 -> 2047       : 36       |                                 =
       |
>> >       2048 -> 4095       : 5063     |*****                            =
       |
>> >       4096 -> 8191       : 31226    |******************************** =
       |
>> >       8192 -> 16383      : 37606    |*********************************=
****** |
>> >      16384 -> 32767      : 38359    |*********************************=
*******|
>> >      32768 -> 65535      : 30652    |*******************************  =
       |
>> >      65536 -> 131071     : 18714    |*******************              =
       |
>> >     131072 -> 262143     : 7968     |********                         =
       |
>> >     262144 -> 524287     : 1996     |**                               =
       |
>> >     524288 -> 1048575    : 302      |                                 =
       |
>> >    1048576 -> 2097151    : 19       |                                 =
       |
>> >
>> > avg =3D 40702 nsecs, total: 7002105331 nsecs, count: 172031
>> >
>> > After multiple trials, I observed no significant differences between
>> > each attempt.
>>
>> The test results looks good.
>>
>> > Therefore, we decided to backport your improvements to our local
>> > kernel. Additionally, I propose introducing a new sysctl knob,
>> > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
>> > to easily tune the setting based on their specific workloads.
>>
>> The downside is that the pcp->high decaying (in decay_pcp_high()) will
>> be slower.  That is, it will take longer for idle pages to be freed from
>> PCP to buddy.  One possible solution is to keep the decaying page
>> number, but use a loop as follows to control latency.
>>
>> while (count < decay_number) {
>>         spin_lock();
>>         free_pcppages_bulk(, batch, );
>>         spin_unlock();
>>         count -=3D batch;
>>         if (count)
>>                 cond_resched();
>> }
>
> I will try it with this additional change.
> Thanks for your suggestion.
>
> IIUC, the additional change should be as follows?
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone,
> unsigned int order,
>  int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>  {
>         int high_min, to_drain, batch;
> -       int todo =3D 0;
> +       int todo =3D 0, count =3D 0;
>
>         high_min =3D READ_ONCE(pcp->high_min);
>         batch =3D READ_ONCE(pcp->batch);
> @@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct
> per_cpu_pages *pcp)
>          * control latency.  This caps pcp->high decrement too.
>          */
>         if (pcp->high > high_min) {
> -               pcp->high =3D max3(pcp->count - (batch << pcp_batch_scale=
_max),
> +               pcp->high =3D max3(pcp->count - (batch << 5),

Please avoid to use magic number if possible.  Otherwise looks good to
me.  Thanks!

>                                  pcp->high - (pcp->high >> 3), high_min);
>                 if (pcp->high > high_min)
>                         todo++;
>         }
>
>         to_drain =3D pcp->count - pcp->high;
> -       if (to_drain > 0) {
> +       while (count < to_drain) {
>                 spin_lock(&pcp->lock);
> -               free_pcppages_bulk(zone, to_drain, pcp, 0);
> +               free_pcppages_bulk(zone, batch, pcp, 0);
>                 spin_unlock(&pcp->lock);
> +               count +=3D batch;
>                 todo++;
> +               cond_resched();
>         }

--
Best Regards,
Huang, Ying