From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6241CC3DA41
	for <linux-mm@archiver.kernel.org>; Thu, 11 Jul 2024 12:41:22 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6A8FD6B0085; Thu, 11 Jul 2024 08:41:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 658C76B0088; Thu, 11 Jul 2024 08:41:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4F8646B008A; Thu, 11 Jul 2024 08:41:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 2C1186B0085
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 08:41:21 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id C2A48805AC
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 12:41:20 +0000 (UTC)
X-FDA: 82327432320.14.12CDC27
Received: from mail-qv1-f42.google.com (mail-qv1-f42.google.com [209.85.219.42])
	by imf11.hostedemail.com (Postfix) with ESMTP id E8A5F40015
	for <linux-mm@kvack.org>; Thu, 11 Jul 2024 12:41:18 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="RBNN/BrF";
	spf=pass (imf11.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720701662;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=;
	b=ffHEBE0FTrtIDVkb8o3BkyxLEJzaSbCzXujD+Tx76Yz6+eAVK9sMwIakxn3qlgw7cULqdV
	XOFQ7Ji83mogH+g1M2aAvEDfSn3QrRa1xfbuFl0AdAVLdtutYKo8DGAvCEl5+pCXs2uMZL
	PYyGw5mYOuEARN9HU64zDusblYQCr+M=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="RBNN/BrF";
	spf=pass (imf11.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.219.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720701662; a=rsa-sha256;
	cv=none;
	b=Y/4Dhn6yZ/LT0RcFq4AK83cL58EjiSVF6Hb0HnAqaO+hKGbRgylexLkLYJk4/SftuWNL5p
	Q7k95GGEPQHouwXGXGc//0MKjvzA42QhqwOO48yuMj3/rXnba4WAZjsjjt7ocSArtu8AzE
	rpWLPMUgSHWAPCgLBlL9tAQ8oYrgzZ0=
Received: by mail-qv1-f42.google.com with SMTP id 6a1803df08f44-6b73dc6e7aaso5743026d6.0
        for <linux-mm@kvack.org>; Thu, 11 Jul 2024 05:41:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720701678; x=1721306478; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=;
        b=RBNN/BrFKHesXgFX7Waafbj2jyQloHGopgv0xM2HJ8i0CcdHc5FA+MtgaJlWJxObBc
         NNyHLPqxx9kKrlXbTEWOF7DxbkefsHKNKK5aA0BpW8djHagfnjmh1koDbVgIgs+GEd4J
         HhDNjzWdW7te8PyFsPa5mpYr+lwMSqJlZgYJDltimefR3+fYx0egHtJO62bnQMrtcn6F
         C/hDGEcOHGM2t4gQrT0milC3ZAQAt445Ia1IwEqbI5k7NLG+euzkNvOOG1CigIrYN+xc
         PSN23aPRNFaJ4I1H/BLJWUk/ekYCyIrKcPL5/LbGUHxJrjZYF3b+q8UByST/n6qb1KFs
         J9tA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720701678; x=1721306478;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=w049nPcBIK6icWsWxzLiSndtDBjaDCaPYx5/gFP1ZJY=;
        b=Z8ravK+2YOK+mRbj64GUCtpegTtPJJwRy/cSwucoIafJdV2joxGBxXpF6T/u1bb61T
         NsgKrjBkGR0V/sjbvp7MGFMIMdcXimRxy/+FmR/9q5KjX6BmdLTCJenc7PhQIzTWf/9P
         q2oZjs13TltIKhYzpKyF4ol+nNhwGuQe8sj9P2c9h62tXThVAi2+6O8xTwPSBqFtPtLt
         VF4jNBHdZMvveBQF7kxd/ssuuL37ISBwY0v9kJf/ETLRr/bxYr+qB8M+0pAOg9H4f53B
         MiQg0/aDlilYuD/KAwOMXpyzArB3NnEX4hgfIKH4orCgO1fFHnQhlOyGbuB21SIELmEF
         /dXg==
X-Forwarded-Encrypted: i=1; AJvYcCWybh/Fonn1h4o6wbyZWGJSTQj5RzG4hrF0oDdyrEtte3acgvv8Dv1VsYw/UxXGydNYRrRzkdxWsrK/heZdyFrtzrw=
X-Gm-Message-State: AOJu0YxrDrr9HepqpQR6mwM9V7vuwaNYs0yiZCDQUZKzMq154N38ftjc
	j6oU1x/ON4yJCL18lPk6oWI2PSRFsnooQvaBEbBoESWzn9V29uM9OYj2NGfodAelR/h6FSzt0rq
	NtGag79/b39eEl0HgAoWHTp0ML6U=
X-Google-Smtp-Source: AGHT+IGgKOPd4ZQOsxhYc4Lx6wUd2BBbf2ME9D5kKkSQJcoupfwpZ+xhG8ooH8o4mzr8so1UItUCFI2CCYLMPrjD3WE=
X-Received: by 2002:a05:6214:d01:b0:6b5:49c9:ed53 with SMTP id
 6a1803df08f44-6b61bf4c959mr100885556d6.37.1720701677849; Thu, 11 Jul 2024
 05:41:17 -0700 (PDT)
MIME-Version: 1.0
References: <20240707094956.94654-1-laoar.shao@gmail.com> <874j8yar3z.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <CALOAHbDmLx3Ky6h9kFS_p8A6o-mR8Z46Jnr3d=nOEycJX0SqCg@mail.gmail.com>
 <87sewga0wx.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbBdQY7C8sttb7T18YrGNLzMAtJKxHAvALs8xxdfPajs4Q@mail.gmail.com>
 <87bk349vg4.fsf@yhuang6-desk2.ccr.corp.intel.com> <CALOAHbCEO1L_7zc_Qd6HigJs6Agb-rdGZKrT_nBPpE742PD7OQ@mail.gmail.com>
 <8734og9on4.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <8734og9on4.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Thu, 11 Jul 2024 20:40:41 +0800
Message-ID: <CALOAHbCVj5bi7AgxohoMOjnAr7bPQz5it0wn5NE-eBh+=_BnkA@mail.gmail.com>
Subject: Re: [PATCH 0/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max
To: "Huang, Ying" <ying.huang@intel.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: E8A5F40015
X-Stat-Signature: fa8jhtwis1cmum75fyjjmfhasar8p1ic
X-Rspam-User: 
X-HE-Tag: 1720701678-738527
X-HE-Meta: U2FsdGVkX1/8c4ndrWpZFpwOdiMw9XpshYXeQBMcBMAOx8mIiLJY1/loICNSmb5F+ig34eWugwNewpiKzi4eBnOijgUtl8vdHvsMqZzBS+ts2YzlYWHipgmAr4GytjaEI3ocZqYtWgx+Wx6uOV4BYowQI8XkfkJ1/Yls8K7hDct/CVpORTkCJ3IekaB7ozV8AG/oXsfpQvJJOPvZ2aqeZxZFwmWHmtAkQlEB5+jSBBcpihi4pKs5OkP6GCZ7GmXxixaUIXrSa0XVGQjqSwcMvmgOW7w0zwXRznnC1OJENcT+EKZhOPPOP5IvOhmvAHwet+AbWJwvh2PCVHcC/L+4ra0Ikf5I/s5oJMIxMnKJI3HfeQNY8kitSheg7X7KzFfKwc4Rb5Wpfvy5ZI2wsSSJufolGJBWZUdbTrb+mgoMgQg0Kxk4bjYWWLoxNTUMFZssJiyKhLypfJ4G/jWMZ59KCQiKqTxqR3CRfiQ8VsAHXa2P82Q079FrtiT47CIAoJ2oPrfcLQOKTTj9PGftj8rhHFIuItgIWmOp0Jc1sXDgDNChykAuefxZh+1bX+9O+mIBmuKv8dYjbvPkw8tdHmuYMcV3n49gQWGvjL1jDsVFxLMUWSCPmryD19gcwElitiguI7L+Y1Oww4wzNjv53Bw0DicOLHh6F5ISReMOJFaYn75OKs4Jw4BSR8omZVdXnfL+risP3ySMmDFmvnxZaj93zVRICtZb+R4wyud+vDVidW7DSH8+BFu47FgJjDY1FvjBzC+V7CGK3T/lkL5GTadO+ZOBOy1EcnzFATiF+57urNCc46+Hr/2XAj7wExMaie44igYoxF8tBh7jX6GBRgEvjzCgZjYTe8Tmma70ss2oFhDB20B0pmEcWfGnjXImyHlpbayEq5aVbZJiNU79fch3s67TTPH78GT7g88iQ2BgbTk555drzm0kzcoHxF0jlUco2tLaFOMhJ2BvWf0aSUc
 JyvnPAwC
 y8Iwz6UNaXTuRwv4XEJqfMn+eDTIxO7oIj04L3s4K8PNQ7lxHgB2S6KdVS7/MctKRByzOyDKIKRiAuNLzySSVXUw/FZ++TeaqbBjGMomoOGsTSCY7Ee2T+yLZjyzo+iGgDnUVW6TRMkiO5N7yB7ItMQzmflBQrv74E4bYVzGHxR/qC5+h9LJXGlP/ZXbm8R7diw5qEeOHBf5F0IAqY8hUzstk2yl89QZqFSDER6W8gjty4FoWQZ013GieaKpa88Lz5P9gN35mVZq2zdfC0fppAY1nXI1U8yAU2GwLVqG/OLSyURBPEG0rJlXUv3pZntUNMf/nNX48hppJ0zLaAuDRBouFwMTDKUoNijB7TXkdL93iyFPn5edHHB3ARxRK99GLoWeC9t8LSEXEmCmDjzPma1xlbbavlmq+wybO
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000010, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jul 11, 2024 at 7:05=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Yafang Shao <laoar.shao@gmail.com> writes:
>
> > On Thu, Jul 11, 2024 at 4:38=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >>
> >> > On Thu, Jul 11, 2024 at 2:40=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >>
> >> >> > On Wed, Jul 10, 2024 at 11:02=E2=80=AFAM Huang, Ying <ying.huang@=
intel.com> wrote:
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@gmail.com> writes:
> >> >> >>
> >> >> >> > Background
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > In our containerized environment, we have a specific type of c=
ontainer
> >> >> >> > that runs 18 processes, each consuming approximately 6GB of RS=
S. These
> >> >> >> > processes are organized as separate processes rather than thre=
ads due
> >> >> >> > to the Python Global Interpreter Lock (GIL) being a bottleneck=
 in a
> >> >> >> > multi-threaded setup. Upon the exit of these containers, other
> >> >> >> > containers hosted on the same machine experience significant l=
atency
> >> >> >> > spikes.
> >> >> >> >
> >> >> >> > Investigation
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > My investigation using perf tracing revealed that the root cau=
se of
> >> >> >> > these spikes is the simultaneous execution of exit_mmap() by e=
ach of
> >> >> >> > the exiting processes. This concurrent access to the zone->loc=
k
> >> >> >> > results in contention, which becomes a hotspot and negatively =
impacts
> >> >> >> > performance. The perf results clearly indicate this contention=
 as a
> >> >> >> > primary contributor to the observed latency issues.
> >> >> >> >
> >> >> >> > +   77.02%     0.00%  uwsgi    [kernel.kallsyms]              =
                    [k] mmput
> >> >> >> > -   76.98%     0.01%  uwsgi    [kernel.kallsyms]              =
                    [k] exit_mmap
> >> >> >> >    - 76.97% exit_mmap
> >> >> >> >       - 58.58% unmap_vmas
> >> >> >> >          - 58.55% unmap_single_vma
> >> >> >> >             - unmap_page_range
> >> >> >> >                - 58.32% zap_pte_range
> >> >> >> >                   - 42.88% tlb_flush_mmu
> >> >> >> >                      - 42.76% free_pages_and_swap_cache
> >> >> >> >                         - 41.22% release_pages
> >> >> >> >                            - 33.29% free_unref_page_list
> >> >> >> >                               - 32.37% free_unref_page_commit
> >> >> >> >                                  - 31.64% free_pcppages_bulk
> >> >> >> >                                     + 28.65% _raw_spin_lock
> >> >> >> >                                       1.28% __list_del_entry_v=
alid
> >> >> >> >                            + 3.25% folio_lruvec_lock_irqsave
> >> >> >> >                            + 0.75% __mem_cgroup_uncharge_list
> >> >> >> >                              0.60% __mod_lruvec_state
> >> >> >> >                           1.07% free_swap_cache
> >> >> >> >                   + 11.69% page_remove_rmap
> >> >> >> >                     0.64% __mod_lruvec_page_state
> >> >> >> >       - 17.34% remove_vma
> >> >> >> >          - 17.25% vm_area_free
> >> >> >> >             - 17.23% kmem_cache_free
> >> >> >> >                - 17.15% __slab_free
> >> >> >> >                   - 14.56% discard_slab
> >> >> >> >                        free_slab
> >> >> >> >                        __free_slab
> >> >> >> >                        __free_pages
> >> >> >> >                      - free_unref_page
> >> >> >> >                         - 13.50% free_unref_page_commit
> >> >> >> >                            - free_pcppages_bulk
> >> >> >> >                               + 13.44% _raw_spin_lock
> >> >> >>
> >> >> >> I don't think your change will reduce zone->lock contention cycl=
es.  So,
> >> >> >> I don't find the value of the above data.
> >> >> >>
> >> >> >> > By enabling the mm_page_pcpu_drain() we can locate the pertine=
nt page,
> >> >> >> > with the majority of them being regular order-0 user pages.
> >> >> >> >
> >> >> >> >           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcp=
u_drain: page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
> >> >> >> > e=3D1
> >> >> >> >            <...>-1540432 [224] d..3. 618048.023887: <stack tra=
ce>
> >> >> >> >  =3D> free_pcppages_bulk
> >> >> >> >  =3D> free_unref_page_commit
> >> >> >> >  =3D> free_unref_page_list
> >> >> >> >  =3D> release_pages
> >> >> >> >  =3D> free_pages_and_swap_cache
> >> >> >> >  =3D> tlb_flush_mmu
> >> >> >> >  =3D> zap_pte_range
> >> >> >> >  =3D> unmap_page_range
> >> >> >> >  =3D> unmap_single_vma
> >> >> >> >  =3D> unmap_vmas
> >> >> >> >  =3D> exit_mmap
> >> >> >> >  =3D> mmput
> >> >> >> >  =3D> do_exit
> >> >> >> >  =3D> do_group_exit
> >> >> >> >  =3D> get_signal
> >> >> >> >  =3D> arch_do_signal_or_restart
> >> >> >> >  =3D> exit_to_user_mode_prepare
> >> >> >> >  =3D> syscall_exit_to_user_mode
> >> >> >> >  =3D> do_syscall_64
> >> >> >> >  =3D> entry_SYSCALL_64_after_hwframe
> >> >> >> >
> >> >> >> > The servers experiencing these issues are equipped with impres=
sive
> >> >> >> > hardware specifications, including 256 CPUs and 1TB of memory,=
 all
> >> >> >> > within a single NUMA node. The zoneinfo is as follows,
> >> >> >> >
> >> >> >> > Node 0, zone   Normal
> >> >> >> >   pages free     144465775
> >> >> >> >         boost    0
> >> >> >> >         min      1309270
> >> >> >> >         low      1636587
> >> >> >> >         high     1963904
> >> >> >> >         spanned  564133888
> >> >> >> >         present  296747008
> >> >> >> >         managed  291974346
> >> >> >> >         cma      0
> >> >> >> >         protection: (0, 0, 0, 0)
> >> >> >> > ...
> >> >> >> >   pagesets
> >> >> >> >     cpu: 0
> >> >> >> >               count: 2217
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 1
> >> >> >> >               count: 4510
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >   vm stats threshold: 125
> >> >> >> >     cpu: 2
> >> >> >> >               count: 3059
> >> >> >> >               high:  6392
> >> >> >> >               batch: 63
> >> >> >> >
> >> >> >> > ...
> >> >> >> >
> >> >> >> > The pcp high is around 100 times the batch size.
> >> >> >> >
> >> >> >> > I also traced the latency associated with the free_pcppages_bu=
lk()
> >> >> >> > function during the container exit process:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 148      |*****************        =
               |
> >> >> >> >        512 -> 1023       : 334      |*************************=
***************|
> >> >> >> >       1024 -> 2047       : 33       |***                      =
               |
> >> >> >> >       2048 -> 4095       : 5        |                         =
               |
> >> >> >> >       4096 -> 8191       : 7        |                         =
               |
> >> >> >> >       8192 -> 16383      : 12       |*                        =
               |
> >> >> >> >      16384 -> 32767      : 30       |***                      =
               |
> >> >> >> >      32768 -> 65535      : 21       |**                       =
               |
> >> >> >> >      65536 -> 131071     : 15       |*                        =
               |
> >> >> >> >     131072 -> 262143     : 27       |***                      =
               |
> >> >> >> >     262144 -> 524287     : 84       |**********               =
               |
> >> >> >> >     524288 -> 1048575    : 203      |************************ =
               |
> >> >> >> >    1048576 -> 2097151    : 284      |*************************=
*********      |
> >> >> >> >    2097152 -> 4194303    : 327      |*************************=
************** |
> >> >> >> >    4194304 -> 8388607    : 215      |*************************=
               |
> >> >> >> >    8388608 -> 16777215   : 116      |*************            =
               |
> >> >> >> >   16777216 -> 33554431   : 47       |*****                    =
               |
> >> >> >> >   33554432 -> 67108863   : 8        |                         =
               |
> >> >> >> >   67108864 -> 134217727  : 3        |                         =
               |
> >> >> >> >
> >> >> >> > The latency can reach tens of milliseconds.
> >> >> >> >
> >> >> >> > Experimenting
> >> >> >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> >> >> >
> >> >> >> > vm.percpu_pagelist_high_fraction
> >> >> >> > --------------------------------
> >> >> >> >
> >> >> >> > The kernel version currently deployed in our production enviro=
nment is the
> >> >> >> > stable 6.1.y, and my initial strategy involves optimizing the
> >> >> >>
> >> >> >> IMHO, we should focus on upstream activity in the cover letter a=
nd patch
> >> >> >> description.  And I don't think that it's necessary to describe =
the
> >> >> >> alternative solution with too much details.
> >> >> >>
> >> >> >> > vm.percpu_pagelist_high_fraction parameter. By increasing the =
value of
> >> >> >> > vm.percpu_pagelist_high_fraction, I aim to diminish the batch =
size during
> >> >> >> > page draining, which subsequently leads to a substantial reduc=
tion in
> >> >> >> > latency. After setting the sysctl value to 0x7fffffff, I obser=
ved a notable
> >> >> >> > improvement in latency.
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 120      |                         =
               |
> >> >> >> >        256 -> 511        : 365      |*                        =
               |
> >> >> >> >        512 -> 1023       : 201      |                         =
               |
> >> >> >> >       1024 -> 2047       : 103      |                         =
               |
> >> >> >> >       2048 -> 4095       : 84       |                         =
               |
> >> >> >> >       4096 -> 8191       : 87       |                         =
               |
> >> >> >> >       8192 -> 16383      : 4777     |**************           =
               |
> >> >> >> >      16384 -> 32767      : 10572    |*************************=
******         |
> >> >> >> >      32768 -> 65535      : 13544    |*************************=
***************|
> >> >> >> >      65536 -> 131071     : 12723    |*************************=
************   |
> >> >> >> >     131072 -> 262143     : 8604     |*************************=
               |
> >> >> >> >     262144 -> 524287     : 3659     |**********               =
               |
> >> >> >> >     524288 -> 1048575    : 921      |**                       =
               |
> >> >> >> >    1048576 -> 2097151    : 122      |                         =
               |
> >> >> >> >    2097152 -> 4194303    : 5        |                         =
               |
> >> >> >> >
> >> >> >> > However, augmenting vm.percpu_pagelist_high_fraction can also =
decrease the
> >> >> >> > pcp high watermark size to a minimum of four times the batch s=
ize. While
> >> >> >> > this could theoretically affect throughput, as highlighted by =
Ying[0], we
> >> >> >> > have yet to observe any significant difference in throughput w=
ithin our
> >> >> >> > production environment after implementing this change.
> >> >> >> >
> >> >> >> > Backporting the series "mm: PCP high auto-tuning"
> >> >> >> > -------------------------------------------------
> >> >> >>
> >> >> >> Again, not upstream activity.  We can describe the upstream beha=
vior
> >> >> >> directly.
> >> >> >
> >> >> > Andrew has requested that I provide a more comprehensive analysis=
 of
> >> >> > this issue, and in response, I have endeavored to outline all the
> >> >> > pertinent details in a thorough and detailed manner.
> >> >>
> >> >> IMHO, upstream activity can provide comprehensive analysis of the i=
ssue
> >> >> too.  And, your patch has changed much from the first version.  It'=
s
> >> >> better to describe your current version.
> >> >
> >> > After backporting the pcp auto-tuning feature to the 6.1.y branch, t=
he
> >> > code is almost the same with the upstream kernel wrt the pcp. I have
> >> > thoroughly documented the detailed data showcasing the changes in th=
e
> >> > backported version, providing a clear picture of the results. Howeve=
r,
> >> > it's crucial to note that I am unable to directly run the upstream
> >> > kernel on our production environment due to practical constraints.
> >>
> >> IMHO, the patch is for upstream kernel, not some downstream kernel, so
> >> focus should be the upstream activity.  The issue of the upstream
> >> kernel, and how to resolve it.  The production environment test result=
s
> >> can be used to support the upstream change.
> >
> >  The sole distinction in the pcp between version 6.1.y and the
> > upstream kernel lies solely in the modifications made to the code by
> > you. Furthermore, given that your code changes have now been
> > successfully backported, what else do you expect me to do ?
>
> If you can run the upstream kernel directly with some proxy workloads,
> it will be better.  But, I understand that this may be not easy for you.
>
> So, what I really expect you to do is to organize the patch description
> in an upstream centric way.  Describe the issue of the upstream kernel,
> and how do you resolve it.  Although your test data comes from a
> downstream kernel with the same page allocator behavior.
>
> >>
> >> >> >> > My second endeavor was to backport the series titled
> >> >> >> > "mm: PCP high auto-tuning"[1], which comprises nine individual=
 patches,
> >> >> >> > into our 6.1.y stable kernel version. Subsequent to its deploy=
ment in our
> >> >> >> > production environment, I noted a pronounced reduction in late=
ncy. The
> >> >> >> > observed outcomes are as enumerated below:
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                         =
               |
> >> >> >> >          2 -> 3          : 0        |                         =
               |
> >> >> >> >          4 -> 7          : 0        |                         =
               |
> >> >> >> >          8 -> 15         : 0        |                         =
               |
> >> >> >> >         16 -> 31         : 0        |                         =
               |
> >> >> >> >         32 -> 63         : 0        |                         =
               |
> >> >> >> >         64 -> 127        : 0        |                         =
               |
> >> >> >> >        128 -> 255        : 0        |                         =
               |
> >> >> >> >        256 -> 511        : 0        |                         =
               |
> >> >> >> >        512 -> 1023       : 0        |                         =
               |
> >> >> >> >       1024 -> 2047       : 2        |                         =
               |
> >> >> >> >       2048 -> 4095       : 11       |                         =
               |
> >> >> >> >       4096 -> 8191       : 3        |                         =
               |
> >> >> >> >       8192 -> 16383      : 1        |                         =
               |
> >> >> >> >      16384 -> 32767      : 2        |                         =
               |
> >> >> >> >      32768 -> 65535      : 7        |                         =
               |
> >> >> >> >      65536 -> 131071     : 198      |*********                =
               |
> >> >> >> >     131072 -> 262143     : 530      |************************ =
               |
> >> >> >> >     262144 -> 524287     : 824      |*************************=
*************  |
> >> >> >> >     524288 -> 1048575    : 852      |*************************=
***************|
> >> >> >> >    1048576 -> 2097151    : 714      |*************************=
********       |
> >> >> >> >    2097152 -> 4194303    : 389      |******************       =
               |
> >> >> >> >    4194304 -> 8388607    : 143      |******                   =
               |
> >> >> >> >    8388608 -> 16777215   : 29       |*                        =
               |
> >> >> >> >   16777216 -> 33554431   : 1        |                         =
               |
> >> >> >> >
> >> >> >> > Compared to the previous data, the maximum latency has been re=
duced to
> >> >> >> > less than 30ms.
> >> >> >>
> >> >> >> People don't care too much about page freeing latency during pro=
cesses
> >> >> >> exiting.  Instead, they care more about the process exiting time=
, that
> >> >> >> is, throughput.  So, it's better to show the page allocation lat=
ency
> >> >> >> which is affected by the simultaneous processes exiting.
> >> >> >
> >> >> > I'm confused also. Is this issue really hard to understand ?
> >> >>
> >> >> IMHO, it's better to prove the issue directly.  If you cannot prove=
 it
> >> >> directly, you can try alternative one and describe why.
> >> >
> >> > Not all data can be verified straightforwardly or effortlessly. The
> >> > primary focus lies in the zone->lock contention, which necessitates
> >> > measuring the latency it incurs. To accomplish this, the
> >> > free_pcppages_bulk() function serves as an effective tool for
> >> > evaluation. Therefore, I have opted to specifically measure the
> >> > latency associated with free_pcppages_bulk().
> >> >
> >> > The rationale behind not measuring allocation latency is due to the
> >> > necessity of finding a willing participant to endure potential delay=
s,
> >> > a task that proved unsuccessful as no one expressed interest. In
> >> > contrast, assessing free_pcppages_bulk()'s latency solely requires
> >> > identifying and experimenting with the source causing the delays,
> >> > making it a more feasible approach.
> >>
> >> Can you run a benchmark program that do quite some memory allocation b=
y
> >> yourself to test it?
> >
> > I can have a try.
>
> Thanks!
>
> > However, is it the key point here?
>
> It's better to prove the issue directly instead of indirectly.
>
> > Why can't the lock contention be measured by the freeing?
>
> Have you measured the lock contention after adjusting
> CONFIG_PCP_BATCH_SCALE_MAX?  IIUC, the lock contention will become even
> worse.  Smaller CONFIG_PCP_BATCH_SCALE_MAX helps latency, but it will
> hurt lock contention.  I have said it several times, but it seems that
> you don't agree with me.  Can you prove I'm wrong with data?

Now I understand the point. It seems we have different understandings
regarding the zone lock contention.

    CPU A  (Freer)                    CPU B (Allocator)
lock zone->lock
free pages                              lock zone->lock
unlock zone->lock                  alloc pages
                                               unlock zone->lock

If the Freer holds the zone lock for an extended period, the Allocator
has to wait, right? Isn't that a lock contention issue? Lock
contention affects not only CPU system usage but also latency.

--=20
Regards
Yafang