From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BD9FEC3064D
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jul 2024 12:08:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3D93B6B0093; Tue,  2 Jul 2024 08:08:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 388BC6B0095; Tue,  2 Jul 2024 08:08:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 229896B0099; Tue,  2 Jul 2024 08:08:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id F3F036B0093
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 08:08:36 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 64672162110
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 12:08:36 +0000 (UTC)
X-FDA: 82294690632.26.15B896B
Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51])
	by imf04.hostedemail.com (Postfix) with ESMTP id 934FD40027
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 12:08:34 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DgiGnlw6;
	spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719922084;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mvyXupwHhxjRWVgdpNaSXoy6c53tdqqIhRBd/1alNqk=;
	b=3c308hrvhbTrF+2AmHS0Vn6n0QVVptNl0tVLuNiBN3ayMU3jjsR7NkOS9B/S3Kv8QNzQ4e
	/2Ea6W6FbaxJX/VmMt/XGAu+KGylfPSs8na+RctyvHvbnYvQeHd9Ksjvt5jPhdOeGxFdk1
	cvGY/Akswxdu+zmTN7FTy+eB24MobSA=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DgiGnlw6;
	spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719922084; a=rsa-sha256;
	cv=none;
	b=tN6q1aJxq3HX6jNFe3S++emnTYY/PRMxPkZkGRXcYiK05hJ2AAa0xz30rpb2Cg7o2eqSjx
	afgdGAxVFlr92nUOh8iV4dMVQirjycfCS01GtF3EMdfb56Qf2gJF5vsxoySwDMT7EPI7ap
	SgYnE3F6ja79v711Z+I/iqrkkqP87UM=
Received: by mail-qv1-f51.google.com with SMTP id 6a1803df08f44-6b5d42758a7so2673936d6.2
        for <linux-mm@kvack.org>; Tue, 02 Jul 2024 05:08:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719922113; x=1720526913; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mvyXupwHhxjRWVgdpNaSXoy6c53tdqqIhRBd/1alNqk=;
        b=DgiGnlw6o5e7QSVB6m7scPCbBA/JveLeTlAZywr5ILFGIBkXXfb0MKNWLoAnRfoWGh
         RMqa+vZSZ5NGjhL1rC6tY620hMb6Hfvnyvall4yTG8m6jAq6n16vmV6XpqFVAzEQtV8P
         lQplSY6yvrrns1OujYW+P2puIhGtDYNfRK/4OFG6OgOKG+ElmX+ud9NgsFjJ0aMOc2nQ
         IN41lwvH1EsnoMv6A4y15UEmEDCC3mxmq3tOt5XJjtNJ+a23SAA1wn+m9K/sxT6X+l7E
         Havhc3J3VWzzLckGiQiRa/YfiWNBUlyIZQ/HPobqNt7dcYUTAp7wz3FxceSVBldtIYOq
         ewaA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719922113; x=1720526913;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mvyXupwHhxjRWVgdpNaSXoy6c53tdqqIhRBd/1alNqk=;
        b=CFJH+4Sp6tjh+m4LvAUwGPQSi0qznA0HSKmelPWpi0K5meSk0raOfdi8s3dJ5WRE9B
         zL+gejYm5+zvBEmZw2Js42sJk0eiugilnQ2Z06jmAcZ6DHNj70LJimDdID60vSTFE5Fa
         /hY6/PH6C4yme/p3SAogP5ZXJGWpMvXDDeEvtEqHEtrwyHKqvnU4TvCPJT8o92WQ9XtF
         BdzxL9EQE9DYgmqfREbh+JYr28C9FRNKfSuLU506crVyr4BuzFkIlF9LoM81Lq1Z6rTf
         6xjjymc1E1cRDadZuIwntP19gbzf+M8wWXW0HUEOfhDBxCmlGpGGZsIK+2VTPBLA1Zwg
         aYjA==
X-Forwarded-Encrypted: i=1; AJvYcCUsj8vp7EFxRS23ptK6m5AtpAthItV3a7wGXqkkQxjAlwOA0p2QWsVUj/TDdwS7V5CwuwUeDSovgmNCOhgFzv3eIEc=
X-Gm-Message-State: AOJu0YzRiZCCLyOxqXxKyndtWxbEXRonK6GF9krHZPSBipm2wrfnLjUY
	yPOZLVOryuuOeRqEtJpRdeL7zpOLCsFkalK5AIhQQJ5gql4uGVT42UlQ83ENQz3N/w/ubyxqvTD
	7oCgSa4JFb+wMTED3FwUFK+oFfcc=
X-Google-Smtp-Source: AGHT+IE/7xuQPJoXtjKcyEZnzYfZGdQS+xyegdJm01k5RfF9F/GhKOxeaQM6UTLNAU6g2Qf3p43Klc/HpNrcUv1fMq4=
X-Received: by 2002:a05:6214:2522:b0:6b0:6629:bdf9 with SMTP id
 6a1803df08f44-6b5b70aa4aamr96110086d6.21.1719922113534; Tue, 02 Jul 2024
 05:08:33 -0700 (PDT)
MIME-Version: 1.0
References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
 <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com> <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Tue, 2 Jul 2024 20:07:57 +0800
Message-ID: <CALOAHbDn+ax1oeaM4at+tNW6B+rEK6zy-32Upr7S5KcJu=JmOw@mail.gmail.com>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction
 to set the minimum pagelist
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com>, 
	Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 934FD40027
X-Stat-Signature: wzbsas89dn8pf6c7ryo5usj5yhp9jbna
X-Rspam-User: 
X-HE-Tag: 1719922114-742538
X-HE-Meta: U2FsdGVkX1/K1MoWELqgdiPKdDScxjr6EqBC8fOo1aOBvpgI7cR50U37suS4oyAWb80yZ8eitejJvDpXBRGbg1k2PHIjaE3OOcUMrI5BH4MWHXW0I+QHQHZa51ey+sh/rY89/YjtTXCOiq8dohI0AD6gJ3k/zFDiWKvvGhvBGuneY2TXrqfv/M+wMGJZTZnTmSS0dYBI28WkOA6Vs4rL+tJr4/NF4Rk/oqZGzqKob15HmxPF+MwnYtITK2gprL0+m74afIMKZlavd+W8KmqjydogybuyI6QFxHo/jd59At8sH2NgzBAKZbFE+KN94itoXjW6lDpFiGeQTKV/zMd6l451b1OBrQwsjPC9yls7UPlQlg5OnbiwpMoAe83oq+fcYeir+HvjdNRcQTDGJSOD/cDUO7arIUfP+n9mtGf7jXED45+A516jzr5pwuemV22Uj4+MnZd9TgssWTIF7u9c9dQMjmyz6V4ywZyxZ7WHyUYAoMTMozV7VT9Ml3Fw4TzLUZLI4NbOjRNuUKrPKTbXgb0ohmaEQPAdkYFmtbsXwQY6k50xye6bCH95GESTEfgM6cc/LwBdUx6AFiOFQtQTAhLmT5QLLeWQXNKb/WsGbYP5/FlgdoP8oHQ170gmjLDhXxzDBZIXItOmrzMJ8oQHvEp3CPiVQ+SFzb+N9xhG1KSYAZL+6DAV4AShmEkk3YNOC/GpztaFqlLjQ9lm2yxSiwO6VyeiVOAvakX/41r5Jf3PDqIkq+/T/4wKy8+cX8E+dBHAMjBb7l5Ef/LDrohO7Zf1Cx85taiqMH0/WslJi33Y6KjGFLrrTF2/lWE9qse6Exg+9r2Q8pVYtAOsuQ6Tk60IEIV/ALw5C03e8nnXiUsyhpSXk9e4g7jDeL68yLt/d6izGZPHdyeWEu9kk0k4VAYNplVUlq7ZH54UpwflMyehhlXv/vTijXGm4m3W3hpSYQ9SXUDfkCzFTy55xnk
 mSc3yTxk
 I+qCA2f1Jhq7mmEY9K04XwKMegq+3wqfHRIASjtEbvOj8z3BF4ZeouymNhXovGZVTAz/PsrnZLjCHFAhEU6PYqlng+vj2OJornlg7qGjoH7SBy5Qaxv+0o2U3Ntuqre8EGyXwb44zZtm49z5TlYnfmVc+ncshgoKsw61BD3Dudjt/b6V14XDR/SWwlNYbxBEmKXEfyvoeZjb4TUMA7HA1dP87ijGD0/TfMHexKj0ban9+/YLSwFBibXo+umI4yOOJJcC2+sRFw4BBi0E1v3mb7C4ILe5iMykH0btg1yGmk8gUB+CrbzkR7+UZy7rXp7mb5AL0eBjPDQ+Aqy9VDSzwvwQyK36T/OWke0v/OJKx8LExnHV8HgKraO9W9D8xi4z/85Vlcrd+XfAeuQe77yDC8q6sUp3/5Gn6R44H/vjTH1aYDdgFLhaQNbN8628NY58emdk+dUai+m0j79OC8ki3OdVyKj6n2VwurAWU
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jul 2, 2024 at 5:10=E2=80=AFPM Huang, Ying <ying.huang@intel.com> w=
rote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm@linux-found=
ation.org> wrote:
> >>
> >> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> =
wrote:
> >>
> >> > Currently, we're encountering latency spikes in our container enviro=
nment
> >> > when a specific container with multiple Python-based tasks exits. Th=
ese
> >> > tasks may hold the zone->lock for an extended period, significantly
> >> > impacting latency for other containers attempting to allocate memory=
.
> >>
> >> Is this locking issue well understood?  Is anyone working on it?  A
> >> reasonably detailed description of the issue and a description of any
> >> ongoing work would be helpful here.
> >
> > In our containerized environment, we have a specific type of container
> > that runs 18 processes, each consuming approximately 6GB of RSS. These
> > processes are organized as separate processes rather than threads due
> > to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> > multi-threaded setup. Upon the exit of these containers, other
> > containers hosted on the same machine experience significant latency
> > spikes.
> >
> > Our investigation using perf tracing revealed that the root cause of
> > these spikes is the simultaneous execution of exit_mmap() by each of
> > the exiting processes. This concurrent access to the zone->lock
> > results in contention, which becomes a hotspot and negatively impacts
> > performance. The perf results clearly indicate this contention as a
> > primary contributor to the observed latency issues.
> >
> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
> >            [k] mmput                                   =E2=96=92
> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
> >            [k] exit_mmap                               =E2=96=92
> >    - 76.97% exit_mmap
> >                                                        =E2=96=92
> >       - 58.58% unmap_vmas
> >                                                        =E2=96=92
> >          - 58.55% unmap_single_vma
> >                                                        =E2=96=92
> >             - unmap_page_range
> >                                                        =E2=96=92
> >                - 58.32% zap_pte_range
> >                                                        =E2=96=92
> >                   - 42.88% tlb_flush_mmu
> >                                                        =E2=96=92
> >                      - 42.76% free_pages_and_swap_cache
> >                                                        =E2=96=92
> >                         - 41.22% release_pages
> >                                                        =E2=96=92
> >                            - 33.29% free_unref_page_list
> >                                                        =E2=96=92
> >                               - 32.37% free_unref_page_commit
> >                                                        =E2=96=92
> >                                  - 31.64% free_pcppages_bulk
> >                                                        =E2=96=92
> >                                     + 28.65% _raw_spin_lock
> >                                                        =E2=96=92
> >                                       1.28% __list_del_entry_valid
> >                                                        =E2=96=92
> >                            + 3.25% folio_lruvec_lock_irqsave
> >                                                        =E2=96=92
> >                            + 0.75% __mem_cgroup_uncharge_list
> >                                                        =E2=96=92
> >                              0.60% __mod_lruvec_state
> >                                                        =E2=96=92
> >                           1.07% free_swap_cache
> >                                                        =E2=96=92
> >                   + 11.69% page_remove_rmap
> >                                                        =E2=96=92
> >                     0.64% __mod_lruvec_page_state
> >       - 17.34% remove_vma
> >                                                        =E2=96=92
> >          - 17.25% vm_area_free
> >                                                        =E2=96=92
> >             - 17.23% kmem_cache_free
> >                                                        =E2=96=92
> >                - 17.15% __slab_free
> >                                                        =E2=96=92
> >                   - 14.56% discard_slab
> >                                                        =E2=96=92
> >                        free_slab
> >                                                        =E2=96=92
> >                        __free_slab
> >                                                        =E2=96=92
> >                        __free_pages
> >                                                        =E2=96=92
> >                      - free_unref_page
> >                                                        =E2=96=92
> >                         - 13.50% free_unref_page_commit
> >                                                        =E2=96=92
> >                            - free_pcppages_bulk
> >                                                        =E2=96=92
> >                               + 13.44% _raw_spin_lock
> >
> > By enabling the mm_page_pcpu_drain() we can find the detailed stack:
> >
> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> > page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
> > e=3D1
> >            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
> >  =3D> free_pcppages_bulk
> >  =3D> free_unref_page_commit
> >  =3D> free_unref_page_list
> >  =3D> release_pages
> >  =3D> free_pages_and_swap_cache
> >  =3D> tlb_flush_mmu
> >  =3D> zap_pte_range
> >  =3D> unmap_page_range
> >  =3D> unmap_single_vma
> >  =3D> unmap_vmas
> >  =3D> exit_mmap
> >  =3D> mmput
> >  =3D> do_exit
> >  =3D> do_group_exit
> >  =3D> get_signal
> >  =3D> arch_do_signal_or_restart
> >  =3D> exit_to_user_mode_prepare
> >  =3D> syscall_exit_to_user_mode
> >  =3D> do_syscall_64
> >  =3D> entry_SYSCALL_64_after_hwframe
> >
> > The servers experiencing these issues are equipped with impressive
> > hardware specifications, including 256 CPUs and 1TB of memory, all
> > within a single NUMA node. The zoneinfo is as follows,
> >
> > Node 0, zone   Normal
> >   pages free     144465775
> >         boost    0
> >         min      1309270
> >         low      1636587
> >         high     1963904
> >         spanned  564133888
> >         present  296747008
> >         managed  291974346
> >         cma      0
> >         protection: (0, 0, 0, 0)
> > ...
> > ...
> >   pagesets
> >     cpu: 0
> >               count: 2217
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 1
> >               count: 4510
> >               high:  6392
> >               batch: 63
> >   vm stats threshold: 125
> >     cpu: 2
> >               count: 3059
> >               high:  6392
> >               batch: 63
> >
> > ...
> >
> > The high is around 100 times the batch size.
> >
> > We also traced the latency associated with the free_pcppages_bulk()
> > function during the container exit process:
> >
> > 19:48:54
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                  =
      |
> >          2 -> 3          : 0        |                                  =
      |
> >          4 -> 7          : 0        |                                  =
      |
> >          8 -> 15         : 0        |                                  =
      |
> >         16 -> 31         : 0        |                                  =
      |
> >         32 -> 63         : 0        |                                  =
      |
> >         64 -> 127        : 0        |                                  =
      |
> >        128 -> 255        : 0        |                                  =
      |
> >        256 -> 511        : 148      |*****************                 =
      |
> >        512 -> 1023       : 334      |**********************************=
******|
> >       1024 -> 2047       : 33       |***                               =
      |
> >       2048 -> 4095       : 5        |                                  =
      |
> >       4096 -> 8191       : 7        |                                  =
      |
> >       8192 -> 16383      : 12       |*                                 =
      |
> >      16384 -> 32767      : 30       |***                               =
      |
> >      32768 -> 65535      : 21       |**                                =
      |
> >      65536 -> 131071     : 15       |*                                 =
      |
> >     131072 -> 262143     : 27       |***                               =
      |
> >     262144 -> 524287     : 84       |**********                        =
      |
> >     524288 -> 1048575    : 203      |************************          =
      |
> >    1048576 -> 2097151    : 284      |**********************************=
      |
> >    2097152 -> 4194303    : 327      |**********************************=
***** |
> >    4194304 -> 8388607    : 215      |*************************         =
      |
> >    8388608 -> 16777215   : 116      |*************                     =
      |
> >   16777216 -> 33554431   : 47       |*****                             =
      |
> >   33554432 -> 67108863   : 8        |                                  =
      |
> >   67108864 -> 134217727  : 3        |                                  =
      |
> >
> > avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
> >
> > The latency can reach tens of milliseconds.
> >
> > By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> > minimum pagelist high at 4 times the batch size, we were able to
> > significantly reduce the latency associated with the
> > free_pcppages_bulk() function during container exits.:
> >
> >      nsecs               : count     distribution
> >          0 -> 1          : 0        |                                  =
      |
> >          2 -> 3          : 0        |                                  =
      |
> >          4 -> 7          : 0        |                                  =
      |
> >          8 -> 15         : 0        |                                  =
      |
> >         16 -> 31         : 0        |                                  =
      |
> >         32 -> 63         : 0        |                                  =
      |
> >         64 -> 127        : 0        |                                  =
      |
> >        128 -> 255        : 120      |                                  =
      |
> >        256 -> 511        : 365      |*                                 =
      |
> >        512 -> 1023       : 201      |                                  =
      |
> >       1024 -> 2047       : 103      |                                  =
      |
> >       2048 -> 4095       : 84       |                                  =
      |
> >       4096 -> 8191       : 87       |                                  =
      |
> >       8192 -> 16383      : 4777     |**************                    =
      |
> >      16384 -> 32767      : 10572    |*******************************   =
      |
> >      32768 -> 65535      : 13544    |**********************************=
******|
> >      65536 -> 131071     : 12723    |**********************************=
***   |
> >     131072 -> 262143     : 8604     |*************************         =
      |
> >     262144 -> 524287     : 3659     |**********                        =
      |
> >     524288 -> 1048575    : 921      |**                                =
      |
> >    1048576 -> 2097151    : 122      |                                  =
      |
> >    2097152 -> 4194303    : 5        |                                  =
      |
> >
> > avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
> >
> > After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> > knob to set the minimum pagelist high at a level that effectively
> > mitigated latency issues, we observed that other containers were no
> > longer experiencing similar complaints. As a result, we decided to
> > implement this tuning as a permanent workaround and have deployed it
> > across all clusters of servers where these containers may be deployed.
>
> Thanks for your detailed data.
>
> IIUC, the latency of free_pcppages_bulk() during process exiting
> shouldn't be a problem?

Right. The problem arises when the process holds the lock for too
long, causing other processes that are attempting to allocate memory
to experience delays or wait times.

> Because users care more about the total time of
> process exiting, that is, throughput.  And I suspect that the zone->lock
> contention and page allocating/freeing throughput will be worse with
> your configuration?

While reducing throughput may not be a significant concern due to the
minimal difference, the potential for latency spikes, a crucial metric
for assessing system stability, is of greater concern to users. Higher
latency can lead to request errors, impacting the user experience.
Therefore, maintaining stability, even at the cost of slightly lower
throughput, is preferable to experiencing higher throughput with
unstable performance.

>
> But the latency of free_pcppages_bulk() and page allocation in other
> processes is a problem.  And your configuration can help it.
>
> Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
> you have a normal PCP size (high) but smaller PCP batch.  I guess that
> may help both latency and throughput in your system.  Could you give it
> a try?

Currently, our kernel does not include the CONFIG_PCP_BATCH_SCALE_MAX
configuration option. However, I've observed your recent improvements
to the zone->lock mechanism, particularly commit 52166607ecc9 ("mm:
restrict the pcp batch scale factor to avoid too long latency"), which
has prompted me to experiment with manually setting the
pcp->free_factor to zero. While this adjustment provided some
improvement, the results were not as significant as I had hoped.

BTW, perhaps we should consider the implementation of a sysctl knob as
an alternative to CONFIG_PCP_BATCH_SCALE_MAX? This would allow users
to more easily adjust it.

Below is the replyment to your question in another thread:

> Could you measure the run time of free_pcppages_bulk(), this can be done
> via ftrace function_graph tracer.  We want to check whether this is a
> common issue.

I believe this issue is not a common issue, as we have only observed
latency spikes under this specific workload.

> If it is really necessary, can we just use a large enough number for
> vm.percpu_pagelist_high_fraction?  For example, (1 << 30)?

Currently, we are setting the value to 0x7fffffff, which can be
confusing for others due to its arbitrary nature. Given that the
minimum high size is a special value, specifically 4 times the batch
size, I believe it would be more beneficial to introduce a dedicated
sysctl value that clearly represents this setting. This will not only
make the configuration more intuitive for users, but also provide a
clear and documented reference for future reference.

--=20
Regards
Yafang