From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 796AFC30653
	for <linux-mm@archiver.kernel.org>; Fri,  5 Jul 2024 03:04:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 069E36B007B; Thu,  4 Jul 2024 23:04:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F105B6B009D; Thu,  4 Jul 2024 23:04:17 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D63A96B009E; Thu,  4 Jul 2024 23:04:17 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id AE72C6B009C
	for <linux-mm@kvack.org>; Thu,  4 Jul 2024 23:04:17 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 560A91615E6
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 03:04:17 +0000 (UTC)
X-FDA: 82304205354.30.B91C0D7
Received: from mail-qv1-f54.google.com (mail-qv1-f54.google.com [209.85.219.54])
	by imf01.hostedemail.com (Postfix) with ESMTP id 7A8174000C
	for <linux-mm@kvack.org>; Fri,  5 Jul 2024 03:04:15 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=mawIdkJg;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720148642; a=rsa-sha256;
	cv=none;
	b=AJaBtfcJVkq4+rqnkfGlBvBCHSr485Q9Fc0tph2WznRbePSP3yPRFVsyQD/bfIKXDC8Z3R
	rRiEfDU3klpBGY3m/0atGmsL4249RcQtaGZ+YjUL99DrJv0Exlo8YrAvK9aVMXzInQ4E59
	QIT5lwNT27i1rffxBnYz7uUGx/cSbsc=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=mawIdkJg;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.54 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720148642;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=;
	b=jqOaQw+2M4hT6jugrJF3l1iIUJx/MTa2gpJSpvZUKeQP4vbfYWOf8XE7o8VT2IDWHKaCIq
	g6UnsJmXpHjKq3szc4aI5xSGESBK8CvA83AlR8XCTNG4pRV0KkmYUp4nlgpEBwWePNXfP8
	8wQJiYTnxLfu+Uta2+scoGIfl2bbbmQ=
Received: by mail-qv1-f54.google.com with SMTP id 6a1803df08f44-6b42574830fso5080366d6.2
        for <linux-mm@kvack.org>; Thu, 04 Jul 2024 20:04:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720148654; x=1720753454; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=;
        b=mawIdkJgsh6ouwLxe9JGnQLObAsSBWI4lIbQNUrbg21zSZ/dJEcoyuLARfCm0iZfvB
         qSamEGhzEJwizrC3YsOOiXu/6TcR+wMWWz2kufcn8i3lWb9Zt0NE8GiFozgR3mK08kpg
         uiE36l1e6NRDAh3J/wBpFVjbQws7OhqSohgOxPGiS+xr8jsVk/xNlpb3WCt3IKh60XWB
         5mXKIRNYag2TBrbR9PRBm0bw5b3ORjiUe/zSxPlcH9VOi3WEJvu0cr5AKUALzC6L9ZkN
         SfWgLjOKhohcAYJb0y3y10HUADJtm3PEstHXjINcP9d9iuxru2BJpYOzSEaq1InUNxKM
         Iu1A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720148654; x=1720753454;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ATe1s+Gd3uXKWMFz/aO6+qSUnuQaauupC2fmJa44P+I=;
        b=jbeD8++bZXxpW3OcSLZJQZt6d3Bgb3Pg6lmgi2rwYKcLTt8a/fIxesxwTAO8SJYRpA
         b2qljBsFSZo4M/bcBDn0g9Yrr8fEBtviDDsXhuaTP5LXf1KUIKfieOI4FnCRGAmgJTyi
         5oIjuY1qu//Aby0ikh2W8/WILp5AvlJEh6C/E5IiwBQr79l/OtFRvgNsXsvn6If2izyg
         xQpeIUYEzPsVkHAEsRm0C52wptGcMxgYlq7aM4d3DR7t09Dvgu/60vVbTeJByhdNdz3B
         fMRVoMplRFVGM5B/1eZBI6K3VwT9tBpNtEsbytcrK3BODruKQ6G5nd+lvJghLWQTaseg
         W9qw==
X-Forwarded-Encrypted: i=1; AJvYcCX1fzR4/ApBXkRZi2K6afARWyXgK8oilZWG6tkKUfhGTuNE5kitt1AmqryulXTbShUTcnOAavJlQRW7e3qhcIfQK6o=
X-Gm-Message-State: AOJu0YxVEK2R/K2lCC9nKXe3Wl7i7jAqR44DaQugwCBOwpQe7N6bdQPY
	ouMi8aDxrZLJlgCsUKIVpbXP3uavjaOMa9XY1zMhSVU9OZp9fpvgtnGM8UK5uDHsuM1HJPqxTTo
	aBaPUtn4KKUh1kpjK1UO0sLkFaI4=
X-Google-Smtp-Source: AGHT+IEt/HTzdAoYJQZYhJ2iSX1HCYS+WGv+M18kComrmlaz/iugIKRfk3+Tsg03Lpa2b5z+GPvrp98roo+WyjubrrU=
X-Received: by 2002:ad4:574d:0:b0:6a0:c903:7226 with SMTP id
 6a1803df08f44-6b5ed02aa0bmr42860826d6.34.1720148654425; Thu, 04 Jul 2024
 20:04:14 -0700 (PDT)
MIME-Version: 1.0
References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
 <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
 <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com>
 <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbC4_ZFW+U+ORTbaPc1h_iVKnV-kTTbATQaKk6w8RV5O8A@mail.gmail.com>
 <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbC1ckPVx60Ft9CAFmjYMK3asN+x=5E8Qj+YQ6YZSvCsJg@mail.gmail.com>
 <87jzi3kphw.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbACG64faVearXF-eJQLVc4Viv=ShOtpLeQSfVwx2tdr=w@mail.gmail.com>
 <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87plrshbkd.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Fri, 5 Jul 2024 11:03:37 +0800
Message-ID: <CALOAHbD80qgVm-S=BBZ0tD0zjMu4-J9e51mq_9cDeRzehd1gQw@mail.gmail.com>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction
 to set the minimum pagelist
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>, 
	Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: pczytdeyt7fcosq4rawf4dat7axxwhkc
X-Rspamd-Queue-Id: 7A8174000C
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1720148655-442782
X-HE-Meta: U2FsdGVkX181ZLRmZJXP/1+Q1un/XySHetBEV2BLvClkPzxRi7amu8h3Erbn6/Amx6v3+0LdczT5Lva9CaODU7SIMAO3aAz88+85z/0xeBwAnmgD9D0LofBpDBnpU5y6aXo5n1suINPC9ajl6YH1aTmfz9iwBU3AP6Z2qE37vFnEbm2hQ/9iUdBkZU9cW1JhbHi+1hBhiXiUvcZDLE36Uz4Z2GKVMfk7IUM/e6lki0vjSuAJzFIIYRN69FMUoPCb4xyXHKAv7g8yTiYoQoEv7Od5mdre5yRYzYTHZalquc9iJTF9I5U5xne2XMOAGuNZo/b8gqlJO2HNBgEOXNEYhBXNmKODf+b2G3kSbRpCvBqdlUWiV2oG4edfq/EkDCgSApLLHpWVsHw0IOnlYPxO/b5jEQJWfSokTsyVVKGbw/MX80MhoksdQ+iznZYPhl+RoP4pMJRbXIV0BJ0C/buL/u5OyqIHtMYCQ5HcahwUUhV0tjwIJHP79Pgd3yM11qxXV6VIHWozKpBvGJKQZ6cZ9kV9/yS5vJRRHe89xQqpPbm6FlT2vLpOGKNBQh6EYX3w9o85fZH6TIusqVgJ8Q9iZgb801sYxBtl+0X/sNh1VoYEBoqwR/N26AMiOSLHBqqJKmPXYT5TZ6FCWTo0cZJ6j8igwwNYKfx7d4otYxhQmlkRiQ7Po2UlielzI/B8irtA/RjQzmRVVb6w7w3NVnEb3psx30D3H++ISS76z9PNUxpjn53SL9+NtwbgCPAcd9vXpZMRt0/vRymdeDQg1lLPHAHx1XU2zpKMbkldM9an4z3DkXSTY887gUAWThglwH9NwfPqJDBez8U4T95i5qQ9VpXIEmJmW5on1AIag16EM9dOerlZ/hcYNH5CwSHxCiFRG6XgcS8TsszFk4atcLR1mCj/BTPFmvGpEpoDqqi75CFn8BypgrBBDp+UqTY+XQ5RN/b0UHAKmuyK5AoO6Hu
 Xvb17eb+
 1Ro1la3kYdRwqvuYwKPM88RTsKAmlvY2vIsUtcipwnJUBnh1icbGkiOfE9RkWPfrZON0GaIrAccn2ly7IBgkCnBnvxgUTAMdxvGweOYUCgdpbEB13+79xAI+vBmdjOV84pevSQy79+hMpx/6Y6y+INdndlKJ40Nq/rOSiStq6EWZXN0pG7g1t8E22gCFG4xQEWobPOusnBjZGNUBYWhB8azsC4FsMfA8iaMmQ1T7FH/rZt5iD3w5xMDGH7IbIqj0M4I+r751ht1y/XhOKq0E3M8nP8jEBvTGDBnyGZUoMBZi5gqPwnJB/iGxwVkWYvTbiCbIhCX7rrM6vSjeqGsGecqzQVro7LL94yNA7+Hw7wtJTwCOf9lOui5xwydcZf72eOTbk+WZJVd4DJFjQo86GWexiG8GpCIusExfZzbvFeZ2/sK4j3CkOcUPUzVaIuHcCPGPlj4o+0DgiJCTWsJcrHZB3U3VPfluqKxn1
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 5, 2024 at 9:30=E2=80=AFAM Huang, Ying <ying.huang@intel.com> w=
rote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 1:36=E2=80=AFPM Huang, Ying <ying.huang@intel.co=
m> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying <ying.huang@in=
tel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying <ying.huang=
@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >> >>
> >> >> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm=
@linux-foundation.org> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao=
@gmail.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> > Currently, we're encountering latency spikes in our cont=
ainer environment
> >> >> >> >> >> > when a specific container with multiple Python-based tas=
ks exits. These
> >> >> >> >> >> > tasks may hold the zone->lock for an extended period, si=
gnificantly
> >> >> >> >> >> > impacting latency for other containers attempting to all=
ocate memory.
> >> >> >> >> >>
> >> >> >> >> >> Is this locking issue well understood?  Is anyone working =
on it?  A
> >> >> >> >> >> reasonably detailed description of the issue and a descrip=
tion of any
> >> >> >> >> >> ongoing work would be helpful here.
> >> >> >> >> >
> >> >> >> >> > In our containerized environment, we have a specific type o=
f container
> >> >> >> >> > that runs 18 processes, each consuming approximately 6GB of=
 RSS. These
> >> >> >> >> > processes are organized as separate processes rather than t=
hreads due
> >> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottlen=
eck in a
> >> >> >> >> > multi-threaded setup. Upon the exit of these containers, ot=
her
> >> >> >> >> > containers hosted on the same machine experience significan=
t latency
> >> >> >> >> > spikes.
> >> >> >> >> >
> >> >> >> >> > Our investigation using perf tracing revealed that the root=
 cause of
> >> >> >> >> > these spikes is the simultaneous execution of exit_mmap() b=
y each of
> >> >> >> >> > the exiting processes. This concurrent access to the zone->=
lock
> >> >> >> >> > results in contention, which becomes a hotspot and negative=
ly impacts
> >> >> >> >> > performance. The perf results clearly indicate this content=
ion as a
> >> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >> >
> >> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >> >> >> >            [k] mmput                                   =E2=
=96=92
> >> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >> >> >> >            [k] exit_mmap                               =E2=
=96=92
> >> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >             - unmap_page_range
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                         - 41.22% release_pages
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                               - 32.37% free_unref_page_comm=
it
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                                  - 31.64% free_pcppages_bul=
k
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                                       1.28% __list_del_entr=
y_valid
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsav=
e
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_li=
st
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >> >       - 17.34% remove_vma
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                - 17.15% __slab_free
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                        free_slab
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                        __free_slab
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                        __free_pages
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                      - free_unref_page
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >> >                                                        =E2=
=96=92
> >> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >> >> >
> >> >> >> >> > By enabling the mm_page_pcpu_drain() we can find the detail=
ed stack:
> >> >> >> >> >
> >> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_=
pcpu_drain:
> >> >> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratet=
yp
> >> >> >> >> > e=3D1
> >> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack =
trace>
> >> >> >> >> >  =3D> free_pcppages_bulk
> >> >> >> >> >  =3D> free_unref_page_commit
> >> >> >> >> >  =3D> free_unref_page_list
> >> >> >> >> >  =3D> release_pages
> >> >> >> >> >  =3D> free_pages_and_swap_cache
> >> >> >> >> >  =3D> tlb_flush_mmu
> >> >> >> >> >  =3D> zap_pte_range
> >> >> >> >> >  =3D> unmap_page_range
> >> >> >> >> >  =3D> unmap_single_vma
> >> >> >> >> >  =3D> unmap_vmas
> >> >> >> >> >  =3D> exit_mmap
> >> >> >> >> >  =3D> mmput
> >> >> >> >> >  =3D> do_exit
> >> >> >> >> >  =3D> do_group_exit
> >> >> >> >> >  =3D> get_signal
> >> >> >> >> >  =3D> arch_do_signal_or_restart
> >> >> >> >> >  =3D> exit_to_user_mode_prepare
> >> >> >> >> >  =3D> syscall_exit_to_user_mode
> >> >> >> >> >  =3D> do_syscall_64
> >> >> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
> >> >> >> >> >
> >> >> >> >> > The servers experiencing these issues are equipped with imp=
ressive
> >> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memo=
ry, all
> >> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >> >
> >> >> >> >> > Node 0, zone   Normal
> >> >> >> >> >   pages free     144465775
> >> >> >> >> >         boost    0
> >> >> >> >> >         min      1309270
> >> >> >> >> >         low      1636587
> >> >> >> >> >         high     1963904
> >> >> >> >> >         spanned  564133888
> >> >> >> >> >         present  296747008
> >> >> >> >> >         managed  291974346
> >> >> >> >> >         cma      0
> >> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> >> > ...
> >> >> >> >> > ...
> >> >> >> >> >   pagesets
> >> >> >> >> >     cpu: 0
> >> >> >> >> >               count: 2217
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >   vm stats threshold: 125
> >> >> >> >> >     cpu: 1
> >> >> >> >> >               count: 4510
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >   vm stats threshold: 125
> >> >> >> >> >     cpu: 2
> >> >> >> >> >               count: 3059
> >> >> >> >> >               high:  6392
> >> >> >> >> >               batch: 63
> >> >> >> >> >
> >> >> >> >> > ...
> >> >> >> >> >
> >> >> >> >> > The high is around 100 times the batch size.
> >> >> >> >> >
> >> >> >> >> > We also traced the latency associated with the free_pcppage=
s_bulk()
> >> >> >> >> > function during the container exit process:
> >> >> >> >> >
> >> >> >> >> > 19:48:54
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 0        |                      =
                  |
> >> >> >> >> >        256 -> 511        : 148      |*****************     =
                  |
> >> >> >> >> >        512 -> 1023       : 334      |**********************=
******************|
> >> >> >> >> >       1024 -> 2047       : 33       |***                   =
                  |
> >> >> >> >> >       2048 -> 4095       : 5        |                      =
                  |
> >> >> >> >> >       4096 -> 8191       : 7        |                      =
                  |
> >> >> >> >> >       8192 -> 16383      : 12       |*                     =
                  |
> >> >> >> >> >      16384 -> 32767      : 30       |***                   =
                  |
> >> >> >> >> >      32768 -> 65535      : 21       |**                    =
                  |
> >> >> >> >> >      65536 -> 131071     : 15       |*                     =
                  |
> >> >> >> >> >     131072 -> 262143     : 27       |***                   =
                  |
> >> >> >> >> >     262144 -> 524287     : 84       |**********            =
                  |
> >> >> >> >> >     524288 -> 1048575    : 203      |**********************=
**                |
> >> >> >> >> >    1048576 -> 2097151    : 284      |**********************=
************      |
> >> >> >> >> >    2097152 -> 4194303    : 327      |**********************=
***************** |
> >> >> >> >> >    4194304 -> 8388607    : 215      |**********************=
***               |
> >> >> >> >> >    8388608 -> 16777215   : 116      |*************         =
                  |
> >> >> >> >> >   16777216 -> 33554431   : 47       |*****                 =
                  |
> >> >> >> >> >   33554432 -> 67108863   : 8        |                      =
                  |
> >> >> >> >> >   67108864 -> 134217727  : 3        |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >> >> >
> >> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >> >
> >> >> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter=
 to set the
> >> >> >> >> > minimum pagelist high at 4 times the batch size, we were ab=
le to
> >> >> >> >> > significantly reduce the latency associated with the
> >> >> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >> >> >
> >> >> >> >> >      nsecs               : count     distribution
> >> >> >> >> >          0 -> 1          : 0        |                      =
                  |
> >> >> >> >> >          2 -> 3          : 0        |                      =
                  |
> >> >> >> >> >          4 -> 7          : 0        |                      =
                  |
> >> >> >> >> >          8 -> 15         : 0        |                      =
                  |
> >> >> >> >> >         16 -> 31         : 0        |                      =
                  |
> >> >> >> >> >         32 -> 63         : 0        |                      =
                  |
> >> >> >> >> >         64 -> 127        : 0        |                      =
                  |
> >> >> >> >> >        128 -> 255        : 120      |                      =
                  |
> >> >> >> >> >        256 -> 511        : 365      |*                     =
                  |
> >> >> >> >> >        512 -> 1023       : 201      |                      =
                  |
> >> >> >> >> >       1024 -> 2047       : 103      |                      =
                  |
> >> >> >> >> >       2048 -> 4095       : 84       |                      =
                  |
> >> >> >> >> >       4096 -> 8191       : 87       |                      =
                  |
> >> >> >> >> >       8192 -> 16383      : 4777     |**************        =
                  |
> >> >> >> >> >      16384 -> 32767      : 10572    |**********************=
*********         |
> >> >> >> >> >      32768 -> 65535      : 13544    |**********************=
******************|
> >> >> >> >> >      65536 -> 131071     : 12723    |**********************=
***************   |
> >> >> >> >> >     131072 -> 262143     : 8604     |**********************=
***               |
> >> >> >> >> >     262144 -> 524287     : 3659     |**********            =
                  |
> >> >> >> >> >     524288 -> 1048575    : 921      |**                    =
                  |
> >> >> >> >> >    1048576 -> 2097151    : 122      |                      =
                  |
> >> >> >> >> >    2097152 -> 4194303    : 5        |                      =
                  |
> >> >> >> >> >
> >> >> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >> >> >
> >> >> >> >> > After successfully tuning the vm.percpu_pagelist_high_fract=
ion sysctl
> >> >> >> >> > knob to set the minimum pagelist high at a level that effec=
tively
> >> >> >> >> > mitigated latency issues, we observed that other containers=
 were no
> >> >> >> >> > longer experiencing similar complaints. As a result, we dec=
ided to
> >> >> >> >> > implement this tuning as a permanent workaround and have de=
ployed it
> >> >> >> >> > across all clusters of servers where these containers may b=
e deployed.
> >> >> >> >>
> >> >> >> >> Thanks for your detailed data.
> >> >> >> >>
> >> >> >> >> IIUC, the latency of free_pcppages_bulk() during process exit=
ing
> >> >> >> >> shouldn't be a problem?
> >> >> >> >
> >> >> >> > Right. The problem arises when the process holds the lock for =
too
> >> >> >> > long, causing other processes that are attempting to allocate =
memory
> >> >> >> > to experience delays or wait times.
> >> >> >> >
> >> >> >> >> Because users care more about the total time of
> >> >> >> >> process exiting, that is, throughput.  And I suspect that the=
 zone->lock
> >> >> >> >> contention and page allocating/freeing throughput will be wor=
se with
> >> >> >> >> your configuration?
> >> >> >> >
> >> >> >> > While reducing throughput may not be a significant concern due=
 to the
> >> >> >> > minimal difference, the potential for latency spikes, a crucia=
l metric
> >> >> >> > for assessing system stability, is of greater concern to users=
. Higher
> >> >> >> > latency can lead to request errors, impacting the user experie=
nce.
> >> >> >> > Therefore, maintaining stability, even at the cost of slightly=
 lower
> >> >> >> > throughput, is preferable to experiencing higher throughput wi=
th
> >> >> >> > unstable performance.
> >> >> >> >
> >> >> >> >>
> >> >> >> >> But the latency of free_pcppages_bulk() and page allocation i=
n other
> >> >> >> >> processes is a problem.  And your configuration can help it.
> >> >> >> >>
> >> >> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In t=
hat way,
> >> >> >> >> you have a normal PCP size (high) but smaller PCP batch.  I g=
uess that
> >> >> >> >> may help both latency and throughput in your system.  Could y=
ou give it
> >> >> >> >> a try?
> >> >> >> >
> >> >> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SC=
ALE_MAX
> >> >> >> > configuration option. However, I've observed your recent impro=
vements
> >> >> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 =
("mm:
> >> >> >> > restrict the pcp batch scale factor to avoid too long latency"=
), which
> >> >> >> > has prompted me to experiment with manually setting the
> >> >> >> > pcp->free_factor to zero. While this adjustment provided some
> >> >> >> > improvement, the results were not as significant as I had hope=
d.
> >> >> >> >
> >> >> >> > BTW, perhaps we should consider the implementation of a sysctl=
 knob as
> >> >> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow=
 users
> >> >> >> > to more easily adjust it.
> >> >> >>
> >> >> >> If you cannot test upstream behavior, it's hard to make changes =
to
> >> >> >> upstream.  Could you find a way to do that?
> >> >> >
> >> >> > I'm afraid I can't run an upstream kernel in our production envir=
onment :(
> >> >> > Lots of code changes have to be made.
> >> >>
> >> >> Understand.  Can you find a way to test upstream behavior, not upst=
ream
> >> >> kernel exactly?  Or test the upstream kernel but in a similar but n=
ot
> >> >> exactly production environment.
> >> >
> >> > I'm willing to give it a try, but it may take some time to achieve t=
he
> >> > desired results..
> >>
> >> Thanks!
> >
> > After I backported the series "mm: PCP high auto-tuning," which
> > consists of a total of 9 patches, to our 6.1.y stable kernel and
> > deployed it to our production envrionment, I observed a significant
> > reduction in latency. The results are as follows:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                  =
      |
> >          2 -> 3          : 0        |                                  =
      |
> >          4 -> 7          : 0        |                                  =
      |
> >          8 -> 15         : 0        |                                  =
      |
> >         16 -> 31         : 0        |                                  =
      |
> >         32 -> 63         : 0        |                                  =
      |
> >         64 -> 127        : 0        |                                  =
      |
> >        128 -> 255        : 0        |                                  =
      |
> >        256 -> 511        : 0        |                                  =
      |
> >        512 -> 1023       : 0        |                                  =
      |
> >       1024 -> 2047       : 2        |                                  =
      |
> >       2048 -> 4095       : 11       |                                  =
      |
> >       4096 -> 8191       : 3        |                                  =
      |
> >       8192 -> 16383      : 1        |                                  =
      |
> >      16384 -> 32767      : 2        |                                  =
      |
> >      32768 -> 65535      : 7        |                                  =
      |
> >      65536 -> 131071     : 198      |*********                         =
      |
> >     131072 -> 262143     : 530      |************************          =
      |
> >     262144 -> 524287     : 824      |**********************************=
****  |
> >     524288 -> 1048575    : 852      |**********************************=
******|
> >    1048576 -> 2097151    : 714      |********************************* =
      |
> >    2097152 -> 4194303    : 389      |******************                =
      |
> >    4194304 -> 8388607    : 143      |******                            =
      |
> >    8388608 -> 16777215   : 29       |*                                 =
      |
> >   16777216 -> 33554431   : 1        |                                  =
      |
> >
> > avg =3D 1181478 nsecs, total: 4380921824 nsecs, count: 3708
> >
> > Compared to the previous data, the maximum latency has been reduced to
> > less than 30ms.
>
> That series can reduce the allocation/freeing from/to the buddy system,
> thus reduce the lock contention.
>
> > Additionally, I introduced a new sysctl knob, vm.pcp_batch_scale_max,
> > to replace CONFIG_PCP_BATCH_SCALE_MAX. By tuning
> > vm.pcp_batch_scale_max from the default value of 5 to 0, the maximum
> > latency was further reduced to less than 2ms.
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                  =
      |
> >          2 -> 3          : 0        |                                  =
      |
> >          4 -> 7          : 0        |                                  =
      |
> >          8 -> 15         : 0        |                                  =
      |
> >         16 -> 31         : 0        |                                  =
      |
> >         32 -> 63         : 0        |                                  =
      |
> >         64 -> 127        : 0        |                                  =
      |
> >        128 -> 255        : 0        |                                  =
      |
> >        256 -> 511        : 0        |                                  =
      |
> >        512 -> 1023       : 0        |                                  =
      |
> >       1024 -> 2047       : 36       |                                  =
      |
> >       2048 -> 4095       : 5063     |*****                             =
      |
> >       4096 -> 8191       : 31226    |********************************  =
      |
> >       8192 -> 16383      : 37606    |**********************************=
***** |
> >      16384 -> 32767      : 38359    |**********************************=
******|
> >      32768 -> 65535      : 30652    |*******************************   =
      |
> >      65536 -> 131071     : 18714    |*******************               =
      |
> >     131072 -> 262143     : 7968     |********                          =
      |
> >     262144 -> 524287     : 1996     |**                                =
      |
> >     524288 -> 1048575    : 302      |                                  =
      |
> >    1048576 -> 2097151    : 19       |                                  =
      |
> >
> > avg =3D 40702 nsecs, total: 7002105331 nsecs, count: 172031
> >
> > After multiple trials, I observed no significant differences between
> > each attempt.
>
> The test results looks good.
>
> > Therefore, we decided to backport your improvements to our local
> > kernel. Additionally, I propose introducing a new sysctl knob,
> > vm.pcp_batch_scale_max, to the upstream kernel. This will enable users
> > to easily tune the setting based on their specific workloads.
>
> The downside is that the pcp->high decaying (in decay_pcp_high()) will
> be slower.  That is, it will take longer for idle pages to be freed from
> PCP to buddy.  One possible solution is to keep the decaying page
> number, but use a loop as follows to control latency.
>
> while (count < decay_number) {
>         spin_lock();
>         free_pcppages_bulk(, batch, );
>         spin_unlock();
>         count -=3D batch;
>         if (count)
>                 cond_resched();
> }

I will try it with this additional change.
Thanks for your suggestion.

IIUC, the additional change should be as follows?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2248,7 +2248,7 @@ static int rmqueue_bulk(struct zone *zone,
unsigned int order,
 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
 {
        int high_min, to_drain, batch;
-       int todo =3D 0;
+       int todo =3D 0, count =3D 0;

        high_min =3D READ_ONCE(pcp->high_min);
        batch =3D READ_ONCE(pcp->batch);
@@ -2258,20 +2258,21 @@ int decay_pcp_high(struct zone *zone, struct
per_cpu_pages *pcp)
         * control latency.  This caps pcp->high decrement too.
         */
        if (pcp->high > high_min) {
-               pcp->high =3D max3(pcp->count - (batch << pcp_batch_scale_m=
ax),
+               pcp->high =3D max3(pcp->count - (batch << 5),
                                 pcp->high - (pcp->high >> 3), high_min);
                if (pcp->high > high_min)
                        todo++;
        }

        to_drain =3D pcp->count - pcp->high;
-       if (to_drain > 0) {
+       while (count < to_drain) {
                spin_lock(&pcp->lock);
-               free_pcppages_bulk(zone, to_drain, pcp, 0);
+               free_pcppages_bulk(zone, batch, pcp, 0);
                spin_unlock(&pcp->lock);
+               count +=3D batch;
                todo++;
+               cond_resched();
        }


--=20
Regards
Yafang