From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F982C2BD09
	for <linux-mm@archiver.kernel.org>; Wed,  3 Jul 2024 05:36:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B58726B009C; Wed,  3 Jul 2024 01:36:33 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B079A6B00A3; Wed,  3 Jul 2024 01:36:33 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 981E46B00A5; Wed,  3 Jul 2024 01:36:33 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 709AF6B009C
	for <linux-mm@kvack.org>; Wed,  3 Jul 2024 01:36:33 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 0A4641407F3
	for <linux-mm@kvack.org>; Wed,  3 Jul 2024 05:36:33 +0000 (UTC)
X-FDA: 82297331466.21.A35CE38
Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16])
	by imf29.hostedemail.com (Postfix) with ESMTP id DC3BB12000C
	for <linux-mm@kvack.org>; Wed,  3 Jul 2024 05:36:29 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=H59jX9Zz;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719984979;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=zFMlT7wox6EWZ4711a+C41M2dQ0yH2T58S086CB1xjY=;
	b=Jr3A9SEMOD3219L+kfgKQ8hW2s17ihjqw3q+LEx5L0RWzxEgRgL+PckQfUKUcASi9QutVy
	XY39rxgRWwjqD8BC/1cDGuDUgKyFHqV2ViHrpWOp+kRk8DkpMtgz+RVa6/qAVkMpp+ZvPk
	6KCSxUcQonhIf70IKVexpkMCQ3QxO3M=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=H59jX9Zz;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719984979; a=rsa-sha256;
	cv=none;
	b=e+6jBMbUywcG2swGD4DzJ0Wj5tAkqgR4RW5cN6qQhcns6Pa0ngp4SMpvmkU8L8bQUEr+aa
	HmFPcK7lfOOGfmNODroDS7u5yGahgcoUhLdg/QbqEAPjHq89JieDej7dSVcLavCASHsB4f
	R0/NM6e/P9sfyWE4IdVy9wwYYuX/JZs=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1719984990; x=1751520990;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=ppxF+Cf2SgNccaI0eGzTGBBjKjfRlXCYC9CM+phB8UI=;
  b=H59jX9ZzUtM5b2qzKNFVgUuipRswrKI9Jms+J1QxLPcws+7GCtEUdvy3
   FC/sPOvINZCKGtqxewN2r0hRfUtZRfn/vOab2Os2SGUPL5/JLlxgTbLw/
   HfxUJNuGX+OhooqqcrpucJ0PppN7UnPRhF9Tg8utGW0BnwT4G0YPypJBF
   KzJFgInuk2VP5/rVijLRddiuryoBWqvUEeXwCMs3sLWgj6mkCh1ch7tt7
   DBTh5PYfZgV+xoM4qjvkyW2w+5wO1sCgKq1JeIHqJwux9ffQI2EVTZVIT
   rObxV2FgN2zvTD3JWFDLoBdhbvH5qZgeKkGnO6Q2g3e2cMVPZmtP3zry8
   w==;
X-CSE-ConnectionGUID: JT9pvrjURzSZF75jyAev0w==
X-CSE-MsgGUID: XnJaanoWRWiKvEkssJVhxQ==
X-IronPort-AV: E=McAfee;i="6700,10204,11121"; a="12369876"
X-IronPort-AV: E=Sophos;i="6.09,181,1716274800"; 
   d="scan'208";a="12369876"
Received: from fmviesa002.fm.intel.com ([10.60.135.142])
  by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 22:36:28 -0700
X-CSE-ConnectionGUID: zf70VWscRwi2H/QNjYi2JQ==
X-CSE-MsgGUID: 16OAv0z8RIGupNZMG9/Hug==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,181,1716274800"; 
   d="scan'208";a="69314176"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by fmviesa002-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 22:36:26 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,  linux-mm@kvack.org,  Matthew
 Wilcox <willy@infradead.org>,  David Rientjes <rientjes@google.com>,  Mel
 Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for
 vm.percpu_pagelist_high_fraction to set the minimum pagelist
In-Reply-To: <CALOAHbC1ckPVx60Ft9CAFmjYMK3asN+x=5E8Qj+YQ6YZSvCsJg@mail.gmail.com>
	(Yafang Shao's message of "Wed, 3 Jul 2024 11:44:08 +0800")
References: <20240701142046.6050-1-laoar.shao@gmail.com>
	<20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
	<CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
	<874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com>
	<87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbC4_ZFW+U+ORTbaPc1h_iVKnV-kTTbATQaKk6w8RV5O8A@mail.gmail.com>
	<87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<CALOAHbC1ckPVx60Ft9CAFmjYMK3asN+x=5E8Qj+YQ6YZSvCsJg@mail.gmail.com>
Date: Wed, 03 Jul 2024 13:34:35 +0800
Message-ID: <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: DC3BB12000C
X-Stat-Signature: t55j1t1b9kktq5yn75droxy59inx7q3h
X-HE-Tag: 1719984989-271920
X-HE-Meta: U2FsdGVkX18Q12EJZeDsxMBeDa2pb3dCPF9zWvuVgZDm7AJbfNaTKqhZlBwuDQugJCWjdGeo81MwqlsHBxNhtVJCvnaccjG+iqRA0ThFl7F5T/I+/AheABFqTyLmAoOkz9m/IM8VJe91X2uxar6y7B/1HNbPXsNdLGEpXGon6/bWAqbr5Ls9hd4QDril0Wxm02YLUIuht3cHOsOl3gtTV01+AAGMoJD8RnUuk8/bXMQbDjq5Q3QI1mMKudrwSB4zJqj4+WyHNuqKdgQnozwBTWwNKQDlpoohy+DkNmE3JOlZ/2Q89eSoTs7179HJulD/tJX31fDW05aT/HpuOY0q6buUO/VpOm7bqkNb1V/58NZCLJXV1qv0KwmYQqYehT9tR+vAuJryNFIeHqkAwXijYzwMGo9q2iWCmDpSr3WrNQd/2vMxUH/0kMX+9Q0sCvfsitNEzd0tq33+yz5tvJgVvJLBpVvNRFlagjm7fL9iNmo5QBijfYQr7pcig9WYmoBEiGwxuwMB5jEgLh57M3Eo7T+TcvFRmgz1/mhkNqqnUb8PHQCrOwPxXF/MQflNmFsf3RpIq7qtE0RrkGpf1MqMoHPcXtU18WLBXhjxwrmTtLR/+JzDEAuuVgTwQkeGGE3INmVIffllQY15eGvoxgfnj1DZk2S+5UOetlvomZMqeqKfuwzq1kZFjzqo7ljCxoRNdVk7p/u8yvAYcm2i9FC3MS+/0YuUtdBJ5KyZgxxG/PVacldVQeaNwaAvbnX3I76JNp6ctc4QIpORs+DqsW9xeLGI1alDpN0dNMhYP95EYSI0c9Eh2sQrvX7DjgRAR4RbDV1WcKWTXFCMnEC6EfEN/qJLdWubS4J1h4qyRsnoRbmylLJSPnflsTQUsFCzsbNGQ1q4cOyCnOkUwco246HMhV7icdFjpzpyQkdzZPFiMDXZT/ZvYdw+k4OCnt0mlEKSXxHQcu1otOpqAbgqFbA
 SqGrBpUk
 b4+r58z50IXzjxmJkitKkBL9C+V8VUXOohhfuyWjKSw4cno8adDyprzM3e04v64tnL+x0REpXKxkyRrOIIBlknqlS0Sxkwd0bWljsNv8ZxG8t/zuG5g/0noTxYMyfv0koK7ept0H6FckMoIaqqQmBSAk+63O4XYSsaBUHQrOh76WgtENqfS2aJJz0owj6KuyMMsoRQb5ULc7gpEwBFHkQ34/b8MCnMiueEMUumh28N3Mznft8y8R9Ti791YYZKYC3aOfe+tufwXw1c5DALcm51CsV9eWMkCHIjMvCGQPLP79gxzt370Hkbk0w5/rZLlTAptX21d7aVLRAIMWZg245UCNKPDiQIo8YXYVknfZxeaPrCantBg4gzEbSCA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying <ying.huang@intel.com=
> wrote:
>>
>> Yafang Shao <laoar.shao@gmail.com> writes:
>>
>> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying <ying.huang@intel.c=
om> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >>
>> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
>> >> >>
>> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm@linu=
x-foundation.org> wrote:
>> >> >> >>
>> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmai=
l.com> wrote:
>> >> >> >>
>> >> >> >> > Currently, we're encountering latency spikes in our container=
 environment
>> >> >> >> > when a specific container with multiple Python-based tasks ex=
its. These
>> >> >> >> > tasks may hold the zone->lock for an extended period, signifi=
cantly
>> >> >> >> > impacting latency for other containers attempting to allocate=
 memory.
>> >> >> >>
>> >> >> >> Is this locking issue well understood?  Is anyone working on it=
?  A
>> >> >> >> reasonably detailed description of the issue and a description =
of any
>> >> >> >> ongoing work would be helpful here.
>> >> >> >
>> >> >> > In our containerized environment, we have a specific type of con=
tainer
>> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS.=
 These
>> >> >> > processes are organized as separate processes rather than thread=
s due
>> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck i=
n a
>> >> >> > multi-threaded setup. Upon the exit of these containers, other
>> >> >> > containers hosted on the same machine experience significant lat=
ency
>> >> >> > spikes.
>> >> >> >
>> >> >> > Our investigation using perf tracing revealed that the root caus=
e of
>> >> >> > these spikes is the simultaneous execution of exit_mmap() by eac=
h of
>> >> >> > the exiting processes. This concurrent access to the zone->lock
>> >> >> > results in contention, which becomes a hotspot and negatively im=
pacts
>> >> >> > performance. The perf results clearly indicate this contention a=
s a
>> >> >> > primary contributor to the observed latency issues.
>> >> >> >
>> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>> >> >> >            [k] mmput                                   =E2=96=92
>> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>> >> >> >            [k] exit_mmap                               =E2=96=92
>> >> >> >    - 76.97% exit_mmap
>> >> >> >                                                        =E2=96=92
>> >> >> >       - 58.58% unmap_vmas
>> >> >> >                                                        =E2=96=92
>> >> >> >          - 58.55% unmap_single_vma
>> >> >> >                                                        =E2=96=92
>> >> >> >             - unmap_page_range
>> >> >> >                                                        =E2=96=92
>> >> >> >                - 58.32% zap_pte_range
>> >> >> >                                                        =E2=96=92
>> >> >> >                   - 42.88% tlb_flush_mmu
>> >> >> >                                                        =E2=96=92
>> >> >> >                      - 42.76% free_pages_and_swap_cache
>> >> >> >                                                        =E2=96=92
>> >> >> >                         - 41.22% release_pages
>> >> >> >                                                        =E2=96=92
>> >> >> >                            - 33.29% free_unref_page_list
>> >> >> >                                                        =E2=96=92
>> >> >> >                               - 32.37% free_unref_page_commit
>> >> >> >                                                        =E2=96=92
>> >> >> >                                  - 31.64% free_pcppages_bulk
>> >> >> >                                                        =E2=96=92
>> >> >> >                                     + 28.65% _raw_spin_lock
>> >> >> >                                                        =E2=96=92
>> >> >> >                                       1.28% __list_del_entry_val=
id
>> >> >> >                                                        =E2=96=92
>> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
>> >> >> >                                                        =E2=96=92
>> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
>> >> >> >                                                        =E2=96=92
>> >> >> >                              0.60% __mod_lruvec_state
>> >> >> >                                                        =E2=96=92
>> >> >> >                           1.07% free_swap_cache
>> >> >> >                                                        =E2=96=92
>> >> >> >                   + 11.69% page_remove_rmap
>> >> >> >                                                        =E2=96=92
>> >> >> >                     0.64% __mod_lruvec_page_state
>> >> >> >       - 17.34% remove_vma
>> >> >> >                                                        =E2=96=92
>> >> >> >          - 17.25% vm_area_free
>> >> >> >                                                        =E2=96=92
>> >> >> >             - 17.23% kmem_cache_free
>> >> >> >                                                        =E2=96=92
>> >> >> >                - 17.15% __slab_free
>> >> >> >                                                        =E2=96=92
>> >> >> >                   - 14.56% discard_slab
>> >> >> >                                                        =E2=96=92
>> >> >> >                        free_slab
>> >> >> >                                                        =E2=96=92
>> >> >> >                        __free_slab
>> >> >> >                                                        =E2=96=92
>> >> >> >                        __free_pages
>> >> >> >                                                        =E2=96=92
>> >> >> >                      - free_unref_page
>> >> >> >                                                        =E2=96=92
>> >> >> >                         - 13.50% free_unref_page_commit
>> >> >> >                                                        =E2=96=92
>> >> >> >                            - free_pcppages_bulk
>> >> >> >                                                        =E2=96=92
>> >> >> >                               + 13.44% _raw_spin_lock
>> >> >> >
>> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed st=
ack:
>> >> >> >
>> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_=
drain:
>> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
>> >> >> > e=3D1
>> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>> >> >> >  =3D> free_pcppages_bulk
>> >> >> >  =3D> free_unref_page_commit
>> >> >> >  =3D> free_unref_page_list
>> >> >> >  =3D> release_pages
>> >> >> >  =3D> free_pages_and_swap_cache
>> >> >> >  =3D> tlb_flush_mmu
>> >> >> >  =3D> zap_pte_range
>> >> >> >  =3D> unmap_page_range
>> >> >> >  =3D> unmap_single_vma
>> >> >> >  =3D> unmap_vmas
>> >> >> >  =3D> exit_mmap
>> >> >> >  =3D> mmput
>> >> >> >  =3D> do_exit
>> >> >> >  =3D> do_group_exit
>> >> >> >  =3D> get_signal
>> >> >> >  =3D> arch_do_signal_or_restart
>> >> >> >  =3D> exit_to_user_mode_prepare
>> >> >> >  =3D> syscall_exit_to_user_mode
>> >> >> >  =3D> do_syscall_64
>> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
>> >> >> >
>> >> >> > The servers experiencing these issues are equipped with impressi=
ve
>> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, a=
ll
>> >> >> > within a single NUMA node. The zoneinfo is as follows,
>> >> >> >
>> >> >> > Node 0, zone   Normal
>> >> >> >   pages free     144465775
>> >> >> >         boost    0
>> >> >> >         min      1309270
>> >> >> >         low      1636587
>> >> >> >         high     1963904
>> >> >> >         spanned  564133888
>> >> >> >         present  296747008
>> >> >> >         managed  291974346
>> >> >> >         cma      0
>> >> >> >         protection: (0, 0, 0, 0)
>> >> >> > ...
>> >> >> > ...
>> >> >> >   pagesets
>> >> >> >     cpu: 0
>> >> >> >               count: 2217
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 1
>> >> >> >               count: 4510
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >   vm stats threshold: 125
>> >> >> >     cpu: 2
>> >> >> >               count: 3059
>> >> >> >               high:  6392
>> >> >> >               batch: 63
>> >> >> >
>> >> >> > ...
>> >> >> >
>> >> >> > The high is around 100 times the batch size.
>> >> >> >
>> >> >> > We also traced the latency associated with the free_pcppages_bul=
k()
>> >> >> > function during the container exit process:
>> >> >> >
>> >> >> > 19:48:54
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                           =
             |
>> >> >> >          2 -> 3          : 0        |                           =
             |
>> >> >> >          4 -> 7          : 0        |                           =
             |
>> >> >> >          8 -> 15         : 0        |                           =
             |
>> >> >> >         16 -> 31         : 0        |                           =
             |
>> >> >> >         32 -> 63         : 0        |                           =
             |
>> >> >> >         64 -> 127        : 0        |                           =
             |
>> >> >> >        128 -> 255        : 0        |                           =
             |
>> >> >> >        256 -> 511        : 148      |*****************          =
             |
>> >> >> >        512 -> 1023       : 334      |***************************=
*************|
>> >> >> >       1024 -> 2047       : 33       |***                        =
             |
>> >> >> >       2048 -> 4095       : 5        |                           =
             |
>> >> >> >       4096 -> 8191       : 7        |                           =
             |
>> >> >> >       8192 -> 16383      : 12       |*                          =
             |
>> >> >> >      16384 -> 32767      : 30       |***                        =
             |
>> >> >> >      32768 -> 65535      : 21       |**                         =
             |
>> >> >> >      65536 -> 131071     : 15       |*                          =
             |
>> >> >> >     131072 -> 262143     : 27       |***                        =
             |
>> >> >> >     262144 -> 524287     : 84       |**********                 =
             |
>> >> >> >     524288 -> 1048575    : 203      |************************   =
             |
>> >> >> >    1048576 -> 2097151    : 284      |***************************=
*******      |
>> >> >> >    2097152 -> 4194303    : 327      |***************************=
************ |
>> >> >> >    4194304 -> 8388607    : 215      |*************************  =
             |
>> >> >> >    8388608 -> 16777215   : 116      |*************              =
             |
>> >> >> >   16777216 -> 33554431   : 47       |*****                      =
             |
>> >> >> >   33554432 -> 67108863   : 8        |                           =
             |
>> >> >> >   67108864 -> 134217727  : 3        |                           =
             |
>> >> >> >
>> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>> >> >> >
>> >> >> > The latency can reach tens of milliseconds.
>> >> >> >
>> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to s=
et the
>> >> >> > minimum pagelist high at 4 times the batch size, we were able to
>> >> >> > significantly reduce the latency associated with the
>> >> >> > free_pcppages_bulk() function during container exits.:
>> >> >> >
>> >> >> >      nsecs               : count     distribution
>> >> >> >          0 -> 1          : 0        |                           =
             |
>> >> >> >          2 -> 3          : 0        |                           =
             |
>> >> >> >          4 -> 7          : 0        |                           =
             |
>> >> >> >          8 -> 15         : 0        |                           =
             |
>> >> >> >         16 -> 31         : 0        |                           =
             |
>> >> >> >         32 -> 63         : 0        |                           =
             |
>> >> >> >         64 -> 127        : 0        |                           =
             |
>> >> >> >        128 -> 255        : 120      |                           =
             |
>> >> >> >        256 -> 511        : 365      |*                          =
             |
>> >> >> >        512 -> 1023       : 201      |                           =
             |
>> >> >> >       1024 -> 2047       : 103      |                           =
             |
>> >> >> >       2048 -> 4095       : 84       |                           =
             |
>> >> >> >       4096 -> 8191       : 87       |                           =
             |
>> >> >> >       8192 -> 16383      : 4777     |**************             =
             |
>> >> >> >      16384 -> 32767      : 10572    |***************************=
****         |
>> >> >> >      32768 -> 65535      : 13544    |***************************=
*************|
>> >> >> >      65536 -> 131071     : 12723    |***************************=
**********   |
>> >> >> >     131072 -> 262143     : 8604     |*************************  =
             |
>> >> >> >     262144 -> 524287     : 3659     |**********                 =
             |
>> >> >> >     524288 -> 1048575    : 921      |**                         =
             |
>> >> >> >    1048576 -> 2097151    : 122      |                           =
             |
>> >> >> >    2097152 -> 4194303    : 5        |                           =
             |
>> >> >> >
>> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
>> >> >> >
>> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction s=
ysctl
>> >> >> > knob to set the minimum pagelist high at a level that effectively
>> >> >> > mitigated latency issues, we observed that other containers were=
 no
>> >> >> > longer experiencing similar complaints. As a result, we decided =
to
>> >> >> > implement this tuning as a permanent workaround and have deploye=
d it
>> >> >> > across all clusters of servers where these containers may be dep=
loyed.
>> >> >>
>> >> >> Thanks for your detailed data.
>> >> >>
>> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
>> >> >> shouldn't be a problem?
>> >> >
>> >> > Right. The problem arises when the process holds the lock for too
>> >> > long, causing other processes that are attempting to allocate memory
>> >> > to experience delays or wait times.
>> >> >
>> >> >> Because users care more about the total time of
>> >> >> process exiting, that is, throughput.  And I suspect that the zone=
->lock
>> >> >> contention and page allocating/freeing throughput will be worse wi=
th
>> >> >> your configuration?
>> >> >
>> >> > While reducing throughput may not be a significant concern due to t=
he
>> >> > minimal difference, the potential for latency spikes, a crucial met=
ric
>> >> > for assessing system stability, is of greater concern to users. Hig=
her
>> >> > latency can lead to request errors, impacting the user experience.
>> >> > Therefore, maintaining stability, even at the cost of slightly lower
>> >> > throughput, is preferable to experiencing higher throughput with
>> >> > unstable performance.
>> >> >
>> >> >>
>> >> >> But the latency of free_pcppages_bulk() and page allocation in oth=
er
>> >> >> processes is a problem.  And your configuration can help it.
>> >> >>
>> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that w=
ay,
>> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess =
that
>> >> >> may help both latency and throughput in your system.  Could you gi=
ve it
>> >> >> a try?
>> >> >
>> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_M=
AX
>> >> > configuration option. However, I've observed your recent improvemen=
ts
>> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
>> >> > restrict the pcp batch scale factor to avoid too long latency"), wh=
ich
>> >> > has prompted me to experiment with manually setting the
>> >> > pcp->free_factor to zero. While this adjustment provided some
>> >> > improvement, the results were not as significant as I had hoped.
>> >> >
>> >> > BTW, perhaps we should consider the implementation of a sysctl knob=
 as
>> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
>> >> > to more easily adjust it.
>> >>
>> >> If you cannot test upstream behavior, it's hard to make changes to
>> >> upstream.  Could you find a way to do that?
>> >
>> > I'm afraid I can't run an upstream kernel in our production environmen=
t :(
>> > Lots of code changes have to be made.
>>
>> Understand.  Can you find a way to test upstream behavior, not upstream
>> kernel exactly?  Or test the upstream kernel but in a similar but not
>> exactly production environment.
>
> I'm willing to give it a try, but it may take some time to achieve the
> desired results..

Thanks!

>>
>> >> IIUC, PCP high will not influence allocate/free latency, PCP batch wi=
ll.
>> >
>> > It seems incorrect.
>> > Looks at the code in free_unref_page_commit() :
>> >
>> >     if (pcp->count >=3D high) {
>> >         free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_hi=
gh),
>> >                                           pcp, pindex);
>> >     }
>> >
>> > And nr_pcp_free() :
>> >     min_nr_free =3D batch;
>> >     max_nr_free =3D high - batch;
>> >
>> >     batch =3D clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
>> >     return batch;
>> >
>> > The 'batch' is not a fixed value but changed dynamically, isn't it ?
>>
>> Sorry, my words were confusing.  For 'batch', I mean the value of the
>> "count" parameter of free_pcppages_bulk() actually.  For example, if we
>> change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that.
>
> If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is
> that the pcp->free_count should not exceed (63 << 0), right ? (suppose
> 63 is the default batch size)
> However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) ,
> is that expected ?
>
> Perhaps we should make the change below?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e7313f9d704b..8c52a30201d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone
> *zone, struct per_cpu_pages *pcp,
>         } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
>                 pcp->flags &=3D ~PCPF_PREV_FREE_HIGH_ORDER;
>         }
> -       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
> +       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) {
>                 pcp->free_count +=3D (1 << order);
> +               if (unlikely(pcp->free_count > (batch <<
> CONFIG_PCP_BATCH_SCALE_MAX)))
> +                       pcp->free_count =3D batch << CONFIG_PCP_BATCH_SCA=
LE_MAX;
> +       }
>         high =3D nr_pcp_high(pcp, zone, batch, free_high);
>         if (pcp->count >=3D high) {
>                 free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high,
> free_high),

Or

        pcp->free_count =3D max(pcp->free_count + (1<< order), batch <<
                CONFIG_PCP_BATCH_SCALE_MAX);

--
Best Regards,
Huang, Ying