From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D5017C30653
	for <linux-mm@archiver.kernel.org>; Wed,  3 Jul 2024 03:44:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 513816B007B; Tue,  2 Jul 2024 23:44:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4C2D76B0082; Tue,  2 Jul 2024 23:44:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 364436B0083; Tue,  2 Jul 2024 23:44:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 17BB66B007B
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 23:44:48 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 9134BA1F79
	for <linux-mm@kvack.org>; Wed,  3 Jul 2024 03:44:47 +0000 (UTC)
X-FDA: 82297049814.16.9825F7A
Received: from mail-qv1-f46.google.com (mail-qv1-f46.google.com [209.85.219.46])
	by imf01.hostedemail.com (Postfix) with ESMTP id BB22C40018
	for <linux-mm@kvack.org>; Wed,  3 Jul 2024 03:44:45 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=iA7HNWZ8;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719978273; a=rsa-sha256;
	cv=none;
	b=b60uvQJ0U1OQxtgHd0PKvZIi+Up54HuvfNP2moFU4qt8SqoOHYygHXMU+CE0CQiXD9buCa
	N063BCGuf40RheAAy0BHwRK4kLWV2wrt5yvXcVkfkuxd4F3hqJGFNl1S1hVdv5kZlkQccU
	IK7hc0UEAz/OYWd/k15PMUdfLTa90UI=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=iA7HNWZ8;
	spf=pass (imf01.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.46 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719978273;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=;
	b=zI4/3KibfYWZlvlu8/fKz2AuSCUDULFPSzwTRG9JO8B3LbveR6TeSxfRVxkT1bneIB8nGj
	ADwANNBqtfK23Ylp70XP7XAUvZquqEEzwclBJv/zAazhfGxX/sqwn0QT+ueRk5J0RZaCVo
	Rdx6zrRPsn+VWmDgov7AnzvvfGItmdg=
Received: by mail-qv1-f46.google.com with SMTP id 6a1803df08f44-6b4fced5999so22694766d6.2
        for <linux-mm@kvack.org>; Tue, 02 Jul 2024 20:44:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719978285; x=1720583085; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=;
        b=iA7HNWZ8KBmwOiTq2wCHATM3Zr3Dmz5KeHP7UhPeu1MnyW+N1lCVycoVcYnUUKAQwF
         X0qPA2vSpl6dipxPm6AVWv2gMuQJQKtiF+u+7Ajmoe4sWl3EOnc+h96/TpnE5KjMg0Pl
         oGOxTK8/4KILuXTqLFjj2KYRBVBOFO/9Yf2hTvkrOD+lEogGV/5s9exFz7eSaBm9snGP
         uqel0siKH22TSsVBeu1XGJsl6Zrp7VAee2G5uHGfdV29SvIC77pvSLIRlbXOgRl+u+Qf
         jhRcp4NgP1+Fn1UafNVFaecU7XzVeLMoNqNZDTaO7H4SC+jYW+kTh6gd2+xv99WtErY/
         0Siw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719978285; x=1720583085;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=uc0iS4q+/8wQumgwSRI0ZpRhAVITVT4tmEUs69AUJpc=;
        b=KPlGz6VEys6Smq0ozKCQiNNDgTW3H6bXqU2MKBKtJaueczB28zjwodBOInCJD4wlrE
         3A1WTK6EsWblWb/riSRq1KQqOfubCFC+56B/hhzHcmP3fXiujb5vDGJUcdRVilv2XXZs
         k8iFJ8hHiqaMLZI0sVvN5zdNjKyG4WmYNyILWgxjw4v/KYZ25Yu12xDn6OEL0g81oqUf
         soN73GPJaxH5pbvYj4oMUGA2kvEMfYiI/vPHyGyc10ZyfbQaM9YXQo+/s/bbKuCP0DmH
         U46NRsh2b5ZVHwuocIinNbAaqD70oIkB9Gg9xYSI35p248DXAR5MiXsObTi99s2ozk4Q
         sxWQ==
X-Forwarded-Encrypted: i=1; AJvYcCXefXJ+VjD1XXBouSr3AS0nRtcGRWN+WbDbADFKf63gytAmeyyG81Gt2+Zs0GnDAdiZP9sp3JThvirE4STXG+13xkk=
X-Gm-Message-State: AOJu0YwDPRlDIAHo4WAujcCqhLzTdXQxmdYFO22xoK7GuQByN+S/4IzM
	WuU+PWIE3PfQAHgSzOYflonx6FhPZgGYhxt9M7oB2FPDtw/mVtbHezxLSAjFlo++c3eFvjO7+b9
	WA2/r3Nh+TPMfNXeW0BR9jakrkbs=
X-Google-Smtp-Source: AGHT+IEA+YGULo6JS1j+FpuzZPQtP/MBe9aPMwkYcZQySc9pinsW7EM0kf6l/Gqb/DlcSqxWgrAVPJeuG7js8aV6o+w=
X-Received: by 2002:a05:6214:d61:b0:6b5:6331:4d4 with SMTP id
 6a1803df08f44-6b5b71794f8mr119775396d6.51.1719978284755; Tue, 02 Jul 2024
 20:44:44 -0700 (PDT)
MIME-Version: 1.0
References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
 <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
 <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com>
 <87wmm3kzmy.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbC4_ZFW+U+ORTbaPc1h_iVKnV-kTTbATQaKk6w8RV5O8A@mail.gmail.com>
 <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87plrvkvo8.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 3 Jul 2024 11:44:08 +0800
Message-ID: <CALOAHbC1ckPVx60Ft9CAFmjYMK3asN+x=5E8Qj+YQ6YZSvCsJg@mail.gmail.com>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction
 to set the minimum pagelist
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>, 
	Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: pfokijqn8rr3diged9z48ahsg5tgf1hj
X-Rspamd-Queue-Id: BB22C40018
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-HE-Tag: 1719978285-38143
X-HE-Meta: U2FsdGVkX1+cwVtIyRQAZQKaqkjhR5KaGXiKCcyGLZYE+TXFPv2FckKaHvl0HBCLl6Omx6b70Xvs6cUoqaQ++udaM6cuLoGrZ1ZVAqzpQH9y8RyZohhSNL7DLDryfC88E9YxOjs1VXmu72KbWSVInHFrj+IlzjVIW37y2TiASbDy51nhM/2Dpy6aslbWS+XVrE2s84mUjnZAaA8ntWWu4/XVvYMpQ3cvds27bS93sJ8/v6KDoNG1ioQKTIposditLhHkjmpXXafrPOHbQMauXgp+TUR5W+ARIBgGA0a6P0u2X2SeN1Fr5pf9cRkNHWxOeymhjYBpUIqM161ISiCqDTHnKoaOAvVlbJdpo6h0AGJS1jtXwPzBPGmHhkuB++fSCxDZm3PNdZAWKqliXxun8+e7oi2XGCevX3Ipwsez4m80aqyT4KGOEUpBRUiqEQ2gkFnuuxglApK8SWxTb4sR/9P/FTgPF4MtpHXBh4q3Ednl7lhAhlkm6ywN+Xv2XgtB1pIZMomqbcZzuEfh9BOGDis7ZY2hBZRDrakcFbONZFqEkjg0lfhb1XPe540Bpj24SGyhynyEMtnoOiowCMQ/RMfzqdljzLyLFIJZG16uCQmvjo/EoZ8KwxcApLTWvyd8yo2PIf9flOb4elza6QwsKqcpYwaLVIRJ9TJaAIFlDMgRSZxE+TNCTeid54yNY9a8q6FTf3RmMPst5WgHecXlGmn3cb7joGqtHc4fmUvf6U1icxPmyG2dmZu5nJRo7IwKgYQeP/ex/dRzQ1xn2TmUZTbYbGzK3ltr1Z4BlvqN8KWSu3238bHqhaAVE3vlgenw2uBmxrEWwzealnq3RWlk9UNhfLI08gdj2Sht9KIID1P1612d4BFgC1dU5H6kw73xI41Q7KqfrIwYXuj1itnZ3AZ8QIFGh7gLRRD7oPNS8sgPLQRcBN7u84ni0Fu+0gCtvwyBaFL888I8+znp/Na
 MYABWFEN
 FoxNKkjq4bcfdYbNXuWWjblphn2ue3PycwS+4jYXN28Rfa31IX9KQSdSNPEW1KmW210VHnvy/5iuEIZJDf8qtmlTg+7PYOoYojVWZcCAWQLVOHUxv16JR5pjBYlLEnR1w5EHL0GwsBGwRCi2fcxbp3RXkqIK4AtN3sassbxl/nzmoKaRnOgoEmyyQ5hT362PUyAB6hRzGq6cjJPizNhNIBk93hT34pZJCzDcWohNTPlWLwRGgAH11wJvYG15gY54gXk7pAB3YFryRFQjQhwmk0Bkd+ZsxDSZUG012QRMNJkBYkOujY4D8Dhf29/kFwVgyD+JmJywKb3ROZuY55VaN9XJ3voTjHICx5tdVL6kuB4FoeZC4itVNnNIN12Q0AmNkwKR5GzO/qVCBJ7mXon+GC6TuZsMTLcqsv+SreYcEYmwMXvtgGL244WVXq+hHNBSu8x+FakGc9DpVGTbSMQBIHqrP7ehnhlDWanDM
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jul 3, 2024 at 11:23=E2=80=AFAM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Wed, Jul 3, 2024 at 9:57=E2=80=AFAM Huang, Ying <ying.huang@intel.co=
m> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying <ying.huang@intel=
.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm@linux=
-foundation.org> wrote:
> >> >> >>
> >> >> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail=
.com> wrote:
> >> >> >>
> >> >> >> > Currently, we're encountering latency spikes in our container =
environment
> >> >> >> > when a specific container with multiple Python-based tasks exi=
ts. These
> >> >> >> > tasks may hold the zone->lock for an extended period, signific=
antly
> >> >> >> > impacting latency for other containers attempting to allocate =
memory.
> >> >> >>
> >> >> >> Is this locking issue well understood?  Is anyone working on it?=
  A
> >> >> >> reasonably detailed description of the issue and a description o=
f any
> >> >> >> ongoing work would be helpful here.
> >> >> >
> >> >> > In our containerized environment, we have a specific type of cont=
ainer
> >> >> > that runs 18 processes, each consuming approximately 6GB of RSS. =
These
> >> >> > processes are organized as separate processes rather than threads=
 due
> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck in=
 a
> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> > containers hosted on the same machine experience significant late=
ncy
> >> >> > spikes.
> >> >> >
> >> >> > Our investigation using perf tracing revealed that the root cause=
 of
> >> >> > these spikes is the simultaneous execution of exit_mmap() by each=
 of
> >> >> > the exiting processes. This concurrent access to the zone->lock
> >> >> > results in contention, which becomes a hotspot and negatively imp=
acts
> >> >> > performance. The perf results clearly indicate this contention as=
 a
> >> >> > primary contributor to the observed latency issues.
> >> >> >
> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >> >> >            [k] mmput                                   =E2=96=92
> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >> >> >            [k] exit_mmap                               =E2=96=92
> >> >> >    - 76.97% exit_mmap
> >> >> >                                                        =E2=96=92
> >> >> >       - 58.58% unmap_vmas
> >> >> >                                                        =E2=96=92
> >> >> >          - 58.55% unmap_single_vma
> >> >> >                                                        =E2=96=92
> >> >> >             - unmap_page_range
> >> >> >                                                        =E2=96=92
> >> >> >                - 58.32% zap_pte_range
> >> >> >                                                        =E2=96=92
> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >                                                        =E2=96=92
> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >                                                        =E2=96=92
> >> >> >                         - 41.22% release_pages
> >> >> >                                                        =E2=96=92
> >> >> >                            - 33.29% free_unref_page_list
> >> >> >                                                        =E2=96=92
> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >                                                        =E2=96=92
> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >                                                        =E2=96=92
> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >                                                        =E2=96=92
> >> >> >                                       1.28% __list_del_entry_vali=
d
> >> >> >                                                        =E2=96=92
> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >                                                        =E2=96=92
> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >                                                        =E2=96=92
> >> >> >                              0.60% __mod_lruvec_state
> >> >> >                                                        =E2=96=92
> >> >> >                           1.07% free_swap_cache
> >> >> >                                                        =E2=96=92
> >> >> >                   + 11.69% page_remove_rmap
> >> >> >                                                        =E2=96=92
> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >       - 17.34% remove_vma
> >> >> >                                                        =E2=96=92
> >> >> >          - 17.25% vm_area_free
> >> >> >                                                        =E2=96=92
> >> >> >             - 17.23% kmem_cache_free
> >> >> >                                                        =E2=96=92
> >> >> >                - 17.15% __slab_free
> >> >> >                                                        =E2=96=92
> >> >> >                   - 14.56% discard_slab
> >> >> >                                                        =E2=96=92
> >> >> >                        free_slab
> >> >> >                                                        =E2=96=92
> >> >> >                        __free_slab
> >> >> >                                                        =E2=96=92
> >> >> >                        __free_pages
> >> >> >                                                        =E2=96=92
> >> >> >                      - free_unref_page
> >> >> >                                                        =E2=96=92
> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >                                                        =E2=96=92
> >> >> >                            - free_pcppages_bulk
> >> >> >                                                        =E2=96=92
> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >
> >> >> > By enabling the mm_page_pcpu_drain() we can find the detailed sta=
ck:
> >> >> >
> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_d=
rain:
> >> >> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
> >> >> > e=3D1
> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >> >> >  =3D> free_pcppages_bulk
> >> >> >  =3D> free_unref_page_commit
> >> >> >  =3D> free_unref_page_list
> >> >> >  =3D> release_pages
> >> >> >  =3D> free_pages_and_swap_cache
> >> >> >  =3D> tlb_flush_mmu
> >> >> >  =3D> zap_pte_range
> >> >> >  =3D> unmap_page_range
> >> >> >  =3D> unmap_single_vma
> >> >> >  =3D> unmap_vmas
> >> >> >  =3D> exit_mmap
> >> >> >  =3D> mmput
> >> >> >  =3D> do_exit
> >> >> >  =3D> do_group_exit
> >> >> >  =3D> get_signal
> >> >> >  =3D> arch_do_signal_or_restart
> >> >> >  =3D> exit_to_user_mode_prepare
> >> >> >  =3D> syscall_exit_to_user_mode
> >> >> >  =3D> do_syscall_64
> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
> >> >> >
> >> >> > The servers experiencing these issues are equipped with impressiv=
e
> >> >> > hardware specifications, including 256 CPUs and 1TB of memory, al=
l
> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >
> >> >> > Node 0, zone   Normal
> >> >> >   pages free     144465775
> >> >> >         boost    0
> >> >> >         min      1309270
> >> >> >         low      1636587
> >> >> >         high     1963904
> >> >> >         spanned  564133888
> >> >> >         present  296747008
> >> >> >         managed  291974346
> >> >> >         cma      0
> >> >> >         protection: (0, 0, 0, 0)
> >> >> > ...
> >> >> > ...
> >> >> >   pagesets
> >> >> >     cpu: 0
> >> >> >               count: 2217
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 1
> >> >> >               count: 4510
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >   vm stats threshold: 125
> >> >> >     cpu: 2
> >> >> >               count: 3059
> >> >> >               high:  6392
> >> >> >               batch: 63
> >> >> >
> >> >> > ...
> >> >> >
> >> >> > The high is around 100 times the batch size.
> >> >> >
> >> >> > We also traced the latency associated with the free_pcppages_bulk=
()
> >> >> > function during the container exit process:
> >> >> >
> >> >> > 19:48:54
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                            =
            |
> >> >> >          2 -> 3          : 0        |                            =
            |
> >> >> >          4 -> 7          : 0        |                            =
            |
> >> >> >          8 -> 15         : 0        |                            =
            |
> >> >> >         16 -> 31         : 0        |                            =
            |
> >> >> >         32 -> 63         : 0        |                            =
            |
> >> >> >         64 -> 127        : 0        |                            =
            |
> >> >> >        128 -> 255        : 0        |                            =
            |
> >> >> >        256 -> 511        : 148      |*****************           =
            |
> >> >> >        512 -> 1023       : 334      |****************************=
************|
> >> >> >       1024 -> 2047       : 33       |***                         =
            |
> >> >> >       2048 -> 4095       : 5        |                            =
            |
> >> >> >       4096 -> 8191       : 7        |                            =
            |
> >> >> >       8192 -> 16383      : 12       |*                           =
            |
> >> >> >      16384 -> 32767      : 30       |***                         =
            |
> >> >> >      32768 -> 65535      : 21       |**                          =
            |
> >> >> >      65536 -> 131071     : 15       |*                           =
            |
> >> >> >     131072 -> 262143     : 27       |***                         =
            |
> >> >> >     262144 -> 524287     : 84       |**********                  =
            |
> >> >> >     524288 -> 1048575    : 203      |************************    =
            |
> >> >> >    1048576 -> 2097151    : 284      |****************************=
******      |
> >> >> >    2097152 -> 4194303    : 327      |****************************=
*********** |
> >> >> >    4194304 -> 8388607    : 215      |*************************   =
            |
> >> >> >    8388608 -> 16777215   : 116      |*************               =
            |
> >> >> >   16777216 -> 33554431   : 47       |*****                       =
            |
> >> >> >   33554432 -> 67108863   : 8        |                            =
            |
> >> >> >   67108864 -> 134217727  : 3        |                            =
            |
> >> >> >
> >> >> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >> >> >
> >> >> > The latency can reach tens of milliseconds.
> >> >> >
> >> >> > By adjusting the vm.percpu_pagelist_high_fraction parameter to se=
t the
> >> >> > minimum pagelist high at 4 times the batch size, we were able to
> >> >> > significantly reduce the latency associated with the
> >> >> > free_pcppages_bulk() function during container exits.:
> >> >> >
> >> >> >      nsecs               : count     distribution
> >> >> >          0 -> 1          : 0        |                            =
            |
> >> >> >          2 -> 3          : 0        |                            =
            |
> >> >> >          4 -> 7          : 0        |                            =
            |
> >> >> >          8 -> 15         : 0        |                            =
            |
> >> >> >         16 -> 31         : 0        |                            =
            |
> >> >> >         32 -> 63         : 0        |                            =
            |
> >> >> >         64 -> 127        : 0        |                            =
            |
> >> >> >        128 -> 255        : 120      |                            =
            |
> >> >> >        256 -> 511        : 365      |*                           =
            |
> >> >> >        512 -> 1023       : 201      |                            =
            |
> >> >> >       1024 -> 2047       : 103      |                            =
            |
> >> >> >       2048 -> 4095       : 84       |                            =
            |
> >> >> >       4096 -> 8191       : 87       |                            =
            |
> >> >> >       8192 -> 16383      : 4777     |**************              =
            |
> >> >> >      16384 -> 32767      : 10572    |****************************=
***         |
> >> >> >      32768 -> 65535      : 13544    |****************************=
************|
> >> >> >      65536 -> 131071     : 12723    |****************************=
*********   |
> >> >> >     131072 -> 262143     : 8604     |*************************   =
            |
> >> >> >     262144 -> 524287     : 3659     |**********                  =
            |
> >> >> >     524288 -> 1048575    : 921      |**                          =
            |
> >> >> >    1048576 -> 2097151    : 122      |                            =
            |
> >> >> >    2097152 -> 4194303    : 5        |                            =
            |
> >> >> >
> >> >> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >> >> >
> >> >> > After successfully tuning the vm.percpu_pagelist_high_fraction sy=
sctl
> >> >> > knob to set the minimum pagelist high at a level that effectively
> >> >> > mitigated latency issues, we observed that other containers were =
no
> >> >> > longer experiencing similar complaints. As a result, we decided t=
o
> >> >> > implement this tuning as a permanent workaround and have deployed=
 it
> >> >> > across all clusters of servers where these containers may be depl=
oyed.
> >> >>
> >> >> Thanks for your detailed data.
> >> >>
> >> >> IIUC, the latency of free_pcppages_bulk() during process exiting
> >> >> shouldn't be a problem?
> >> >
> >> > Right. The problem arises when the process holds the lock for too
> >> > long, causing other processes that are attempting to allocate memory
> >> > to experience delays or wait times.
> >> >
> >> >> Because users care more about the total time of
> >> >> process exiting, that is, throughput.  And I suspect that the zone-=
>lock
> >> >> contention and page allocating/freeing throughput will be worse wit=
h
> >> >> your configuration?
> >> >
> >> > While reducing throughput may not be a significant concern due to th=
e
> >> > minimal difference, the potential for latency spikes, a crucial metr=
ic
> >> > for assessing system stability, is of greater concern to users. High=
er
> >> > latency can lead to request errors, impacting the user experience.
> >> > Therefore, maintaining stability, even at the cost of slightly lower
> >> > throughput, is preferable to experiencing higher throughput with
> >> > unstable performance.
> >> >
> >> >>
> >> >> But the latency of free_pcppages_bulk() and page allocation in othe=
r
> >> >> processes is a problem.  And your configuration can help it.
> >> >>
> >> >> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that wa=
y,
> >> >> you have a normal PCP size (high) but smaller PCP batch.  I guess t=
hat
> >> >> may help both latency and throughput in your system.  Could you giv=
e it
> >> >> a try?
> >> >
> >> > Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MA=
X
> >> > configuration option. However, I've observed your recent improvement=
s
> >> > to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
> >> > restrict the pcp batch scale factor to avoid too long latency"), whi=
ch
> >> > has prompted me to experiment with manually setting the
> >> > pcp->free_factor to zero. While this adjustment provided some
> >> > improvement, the results were not as significant as I had hoped.
> >> >
> >> > BTW, perhaps we should consider the implementation of a sysctl knob =
as
> >> > an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
> >> > to more easily adjust it.
> >>
> >> If you cannot test upstream behavior, it's hard to make changes to
> >> upstream.  Could you find a way to do that?
> >
> > I'm afraid I can't run an upstream kernel in our production environment=
 :(
> > Lots of code changes have to be made.
>
> Understand.  Can you find a way to test upstream behavior, not upstream
> kernel exactly?  Or test the upstream kernel but in a similar but not
> exactly production environment.

I'm willing to give it a try, but it may take some time to achieve the
desired results..

>
> >> IIUC, PCP high will not influence allocate/free latency, PCP batch wil=
l.
> >
> > It seems incorrect.
> > Looks at the code in free_unref_page_commit() :
> >
> >     if (pcp->count >=3D high) {
> >         free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_hig=
h),
> >                                           pcp, pindex);
> >     }
> >
> > And nr_pcp_free() :
> >     min_nr_free =3D batch;
> >     max_nr_free =3D high - batch;
> >
> >     batch =3D clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);
> >     return batch;
> >
> > The 'batch' is not a fixed value but changed dynamically, isn't it ?
>
> Sorry, my words were confusing.  For 'batch', I mean the value of the
> "count" parameter of free_pcppages_bulk() actually.  For example, if we
> change CONFIG_PCP_BATCH_SCALE_MAX, we restrict that.

If we set CONFIG_PCP_BATCH_SCALE_MAX to 0, what we actually expect is
that the pcp->free_count should not exceed (63 << 0), right ? (suppose
63 is the default batch size)
However, at worst, the pcp->free_count can be (62 + 1<< (MAX_ORDER)) ,
is that expected ?

Perhaps we should make the change below?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e7313f9d704b..8c52a30201d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2533,8 +2533,11 @@ static void free_unref_page_commit(struct zone
*zone, struct per_cpu_pages *pcp,
        } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
                pcp->flags &=3D ~PCPF_PREV_FREE_HIGH_ORDER;
        }
-       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
+       if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) {
                pcp->free_count +=3D (1 << order);
+               if (unlikely(pcp->free_count > (batch <<
CONFIG_PCP_BATCH_SCALE_MAX)))
+                       pcp->free_count =3D batch << CONFIG_PCP_BATCH_SCALE=
_MAX;
+       }
        high =3D nr_pcp_high(pcp, zone, batch, free_high);
        if (pcp->count >=3D high) {
                free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high,
free_high),

--=20
Regards
Yafang