From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3EA6FC3064D
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jul 2024 06:38:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 73CB76B0095; Tue,  2 Jul 2024 02:38:18 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6ECB16B0096; Tue,  2 Jul 2024 02:38:18 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 58CEE6B0098; Tue,  2 Jul 2024 02:38:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 32A966B0095
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 02:38:18 -0400 (EDT)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id D151DC1FAB
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 06:38:17 +0000 (UTC)
X-FDA: 82293858234.01.DB0203B
Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177])
	by imf04.hostedemail.com (Postfix) with ESMTP id 25BD740007
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 06:38:15 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=F8VBhrt5;
	spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719902274;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=;
	b=HTOKP+iK73T3R/95fOHcT1hMW0OTtA3iCdBrs94OVziDBBQ3+s0JDnCYcWy8d1d3GawpmX
	FCp9+sOJQjZ9HrDirouod0KaH6wY0mICRQSGLz+7ocqcLZmU9vEXJsuxSbPp7wvFHGeX/U
	m+Ppit9V/S5PopPsWeP6VVkn8EE8kUM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719902274; a=rsa-sha256;
	cv=none;
	b=2SOkqhN71+0lf9rkO1V1CsbkTFI6ePe3LMxBfW3VNxfaPrxK5FCtltP4FXhpDq2K3fNv0m
	dDkSisU8o4fIg5IeyW7BjxA0zHMazrPkbP+XXSKNv1H9eBi21QQPtHXITaKJb68ZhxirlC
	QtMjFslOQBDbIf+gXdSi6qv9GEg51xI=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=F8VBhrt5;
	spf=pass (imf04.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-64b29539d87so32956557b3.0
        for <linux-mm@kvack.org>; Mon, 01 Jul 2024 23:38:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1719902295; x=1720507095; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=;
        b=F8VBhrt52MyVTyqEuU0K9TB0mdAdP5pwcF0g24Pl7ZLnUTYyGTWqPoGuerPCmooFyx
         CMvp6GjiLi9G8BV1X9oAoBiK4ZdXnLhB9gu7K/NXfsmdvjqWNkW5Um/dmfVZzioCCUtK
         LbJujK4JhXlk6TwWF08UZTWInt5nIU1wqDRsd0TT+ikO698B5L7MYcDKQbgoCFbEm7md
         pN/hC/8j6Fe8NMzad8UkD+yoss1AGrdnp8hYSCSNNBH7QagQlUGZWdkYVYzaCCC1z1/r
         r+sTyarfjm7BlrZ9NWM/hyhRAKVKmN6hLt4bNKvO2idxITm22KIfvjDx8Ozf/9OH8adC
         NFDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1719902295; x=1720507095;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=KnCTskVLgPQI3M5vh1EQxVc0tqLlJt0b8u6mNUjv0YU=;
        b=cQDPJx7DnxZYtNz5sBPfcsD8jaisk/ktsnB8gz8zoMmHH/mRiNkhOkAS8OowNQByPT
         5L3Kozi575mWnnvqTuhl0HzTYfzaWotfRelfK3e4gqwOv0iGXyy1wY86F7wxuXYzbDee
         Q7PT1K5yKBWK5ZHnd6NFoYiMRGuFkr04JSI4Jilidcz62JiDnfHFkN+Wt943uLcMZvgU
         3M7QfDCEScScvTiehPF8zCG0bkA63HfggKffSFliLj2LLA7HFX0Hz3GDhKgzeTBxmwYt
         QSi6MBh9lihQCs9nRGQ/buFNfkTrOk+NJ9QoiO0B7aAz7ljpCACxrte7Yr4GV21y2ZGT
         whaw==
X-Gm-Message-State: AOJu0YzNYJsFid70B8029q1t6M/JlEF+SVzGaDCLWy3ukFp3TkDiczS0
	x50hw83RvPAByqJB8+htzjDGnsbop6qabdtMcHZn2VCK53OAayB6nmgaBTKWkgx0yKDRzc1sQDY
	JRORymV9tFgCg1qqT7917b/ulJyc=
X-Google-Smtp-Source: AGHT+IE/g123BvIUt8egs7LDSgvx0zto/03/+PhEOxvYthmRnfmop1Ema8vX+uQsUGEPZckziZdDZSlzFxFc6f5evRk=
X-Received: by 2002:a81:9e52:0:b0:64a:4728:eed with SMTP id
 00721157ae682-64c74b5bc93mr86364477b3.46.1719902294993; Mon, 01 Jul 2024
 23:38:14 -0700 (PDT)
MIME-Version: 1.0
References: <20240701142046.6050-1-laoar.shao@gmail.com> <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
In-Reply-To: <20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Tue, 2 Jul 2024 14:37:38 +0800
Message-ID: <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
Subject: Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction
 to set the minimum pagelist
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>, 
	David Rientjes <rientjes@google.com>, "Huang, Ying" <ying.huang@intel.com>, 
	Mel Gorman <mgorman@techsingularity.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 25BD740007
X-Stat-Signature: qnhhinihfn57zzwtg8hskbzwn9e4k79j
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1719902295-486004
X-HE-Meta: U2FsdGVkX1/DsIjcl0T0vAJw2rJD9ilOoFippuPeSxWJ0tcXQcii3fBCZor1jBpvL7Z/JW03iPTUIgBu+K07Q4trlOKHdBgRep6ewpLLiis/jCZKA+VaazR7Es9dz9XZTimFczSqAnBz6Gzq9r9g4/ccpiWE3CKIVp54bg4rBufLqXH4Yd3nCGVKOcgN0tDYNNfWl1zWyLtbIc+F5StYqmlg1VA0hR+vvA0LouhpcAOQilxNJpKJbZfbf2hZUEX5TpSvH9RWOJjSGZqSvM5U8qFYkjxk4WShokOFoIFDZMM7K5y9QkqAUBvulCckcBt6u1EAMfuKZpuAgRjfeGgIcLMda/23uJLGPafSw95Y32wZZDfqKiJBQERGnaCk6FKJ8gyDXa/4Rs1UvfhG4mxFMUtVbAVLGMxO+BHobkJXPneZoCj5WrFGAXoXu0Y3qOudce6rStVpKzn8l4CdDvPrSEjth2k2ZhRb8LY3MTDPIJfJtpNE+PoU4yvK94R48PLJJVlBZquAz4Zc9jOeM6RrtBBnKMOG5NnUKDQ4+qLJkShnPsETsfgJG1nNY4Ggnx2Nut+n+reMIvj/PuaXCDWZDYHULyxMcfMkjzvmyHBxbybkJCUDbxpL7pszrei1pbWkdIIM/ujoII9v+d19CH+p/27vnO0bw5e/N1rQm09sEJaMtjlRHovSFpz73M5UX5WovZyKxwzqmBYq1ZW/HoG4oSn5xf7wmi1W6z3/xdJ2FRY635OHtxG64kQes8PopB7UT6FZ7CjOAYQYjta/o8cagytF6pHOaL2tSIveKyK2DbQ4vf9KyMjStdrj6XbPqY/omIvF8t0umzMhxmkJaFL3lqnL/fwQuBURORMckb6A3uOrpAwBNFmhDHRRX6Wxv5j/Uq0wxEAIHbBxXobY/PvrV2L5yT0orocwqSEf0zGyiFZVwCPJ33cRTzODCCw5El+19MAainCJ79FdpjxSgNZ
 nREAs0yJ
 Q2LZsVwnH2lnXyKVJ8XC0ihFoe2NuMOwg5gIVX34OfHM2a2XZHZtgbPKvX5jejY5eGfEiQu4WeOHUKd7l46lzFNIgWFagE177nR9zufDJrVOtNuhmr/1PyYohPxswj0sEWqDlujMwH1mCD20lWudjWZwvnsCj6cVffiRU1T7lbGFx0uKumC/ye+TRsl96Hp2K3vwkDZ6TpAREHywfQ0N8lWiT1xyS4/r6VTGOp/Veao04DJhcDG5+HdsgaXYvGIEeJqVGa/2fTR91QPSmXzmDENdm+ESYcNNAyOXsPcnfPBkJKWRHHyfzofoWobD8TNRe8lLEPRKxD20hL73gWFMeShS3wluHYzgB+fYspSE9tt3tOBo9lmvL0GAeFeQVTVSo1LXC2Vvwe019SZv/cvCJ+zlsQAk46i++FJFJ6Npa2fEIDj72WemIkz2Eh0WQSLs1yZo9/hi1zbeedP8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm@linux-foundatio=
n.org> wrote:
>
> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wro=
te:
>
> > Currently, we're encountering latency spikes in our container environme=
nt
> > when a specific container with multiple Python-based tasks exits. These
> > tasks may hold the zone->lock for an extended period, significantly
> > impacting latency for other containers attempting to allocate memory.
>
> Is this locking issue well understood?  Is anyone working on it?  A
> reasonably detailed description of the issue and a description of any
> ongoing work would be helpful here.

In our containerized environment, we have a specific type of container
that runs 18 processes, each consuming approximately 6GB of RSS. These
processes are organized as separate processes rather than threads due
to the Python Global Interpreter Lock (GIL) being a bottleneck in a
multi-threaded setup. Upon the exit of these containers, other
containers hosted on the same machine experience significant latency
spikes.

Our investigation using perf tracing revealed that the root cause of
these spikes is the simultaneous execution of exit_mmap() by each of
the exiting processes. This concurrent access to the zone->lock
results in contention, which becomes a hotspot and negatively impacts
performance. The perf results clearly indicate this contention as a
primary contributor to the observed latency issues.

+   77.02%     0.00%  uwsgi    [kernel.kallsyms]
           [k] mmput                                   =E2=96=92
-   76.98%     0.01%  uwsgi    [kernel.kallsyms]
           [k] exit_mmap                               =E2=96=92
   - 76.97% exit_mmap
                                                       =E2=96=92
      - 58.58% unmap_vmas
                                                       =E2=96=92
         - 58.55% unmap_single_vma
                                                       =E2=96=92
            - unmap_page_range
                                                       =E2=96=92
               - 58.32% zap_pte_range
                                                       =E2=96=92
                  - 42.88% tlb_flush_mmu
                                                       =E2=96=92
                     - 42.76% free_pages_and_swap_cache
                                                       =E2=96=92
                        - 41.22% release_pages
                                                       =E2=96=92
                           - 33.29% free_unref_page_list
                                                       =E2=96=92
                              - 32.37% free_unref_page_commit
                                                       =E2=96=92
                                 - 31.64% free_pcppages_bulk
                                                       =E2=96=92
                                    + 28.65% _raw_spin_lock
                                                       =E2=96=92
                                      1.28% __list_del_entry_valid
                                                       =E2=96=92
                           + 3.25% folio_lruvec_lock_irqsave
                                                       =E2=96=92
                           + 0.75% __mem_cgroup_uncharge_list
                                                       =E2=96=92
                             0.60% __mod_lruvec_state
                                                       =E2=96=92
                          1.07% free_swap_cache
                                                       =E2=96=92
                  + 11.69% page_remove_rmap
                                                       =E2=96=92
                    0.64% __mod_lruvec_page_state
      - 17.34% remove_vma
                                                       =E2=96=92
         - 17.25% vm_area_free
                                                       =E2=96=92
            - 17.23% kmem_cache_free
                                                       =E2=96=92
               - 17.15% __slab_free
                                                       =E2=96=92
                  - 14.56% discard_slab
                                                       =E2=96=92
                       free_slab
                                                       =E2=96=92
                       __free_slab
                                                       =E2=96=92
                       __free_pages
                                                       =E2=96=92
                     - free_unref_page
                                                       =E2=96=92
                        - 13.50% free_unref_page_commit
                                                       =E2=96=92
                           - free_pcppages_bulk
                                                       =E2=96=92
                              + 13.44% _raw_spin_lock

By enabling the mm_page_pcpu_drain() we can find the detailed stack:

          <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
e=3D1
           <...>-1540432 [224] d..3. 618048.023887: <stack trace>
 =3D> free_pcppages_bulk
 =3D> free_unref_page_commit
 =3D> free_unref_page_list
 =3D> release_pages
 =3D> free_pages_and_swap_cache
 =3D> tlb_flush_mmu
 =3D> zap_pte_range
 =3D> unmap_page_range
 =3D> unmap_single_vma
 =3D> unmap_vmas
 =3D> exit_mmap
 =3D> mmput
 =3D> do_exit
 =3D> do_group_exit
 =3D> get_signal
 =3D> arch_do_signal_or_restart
 =3D> exit_to_user_mode_prepare
 =3D> syscall_exit_to_user_mode
 =3D> do_syscall_64
 =3D> entry_SYSCALL_64_after_hwframe

The servers experiencing these issues are equipped with impressive
hardware specifications, including 256 CPUs and 1TB of memory, all
within a single NUMA node. The zoneinfo is as follows,

Node 0, zone   Normal
  pages free     144465775
        boost    0
        min      1309270
        low      1636587
        high     1963904
        spanned  564133888
        present  296747008
        managed  291974346
        cma      0
        protection: (0, 0, 0, 0)
...
...
  pagesets
    cpu: 0
              count: 2217
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 1
              count: 4510
              high:  6392
              batch: 63
  vm stats threshold: 125
    cpu: 2
              count: 3059
              high:  6392
              batch: 63

...

The high is around 100 times the batch size.

We also traced the latency associated with the free_pcppages_bulk()
function during the container exit process:

19:48:54
     nsecs               : count     distribution
         0 -> 1          : 0        |                                      =
  |
         2 -> 3          : 0        |                                      =
  |
         4 -> 7          : 0        |                                      =
  |
         8 -> 15         : 0        |                                      =
  |
        16 -> 31         : 0        |                                      =
  |
        32 -> 63         : 0        |                                      =
  |
        64 -> 127        : 0        |                                      =
  |
       128 -> 255        : 0        |                                      =
  |
       256 -> 511        : 148      |*****************                     =
  |
       512 -> 1023       : 334      |**************************************=
**|
      1024 -> 2047       : 33       |***                                   =
  |
      2048 -> 4095       : 5        |                                      =
  |
      4096 -> 8191       : 7        |                                      =
  |
      8192 -> 16383      : 12       |*                                     =
  |
     16384 -> 32767      : 30       |***                                   =
  |
     32768 -> 65535      : 21       |**                                    =
  |
     65536 -> 131071     : 15       |*                                     =
  |
    131072 -> 262143     : 27       |***                                   =
  |
    262144 -> 524287     : 84       |**********                            =
  |
    524288 -> 1048575    : 203      |************************              =
  |
   1048576 -> 2097151    : 284      |**********************************    =
  |
   2097152 -> 4194303    : 327      |**************************************=
* |
   4194304 -> 8388607    : 215      |*************************             =
  |
   8388608 -> 16777215   : 116      |*************                         =
  |
  16777216 -> 33554431   : 47       |*****                                 =
  |
  33554432 -> 67108863   : 8        |                                      =
  |
  67108864 -> 134217727  : 3        |                                      =
  |

avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920

The latency can reach tens of milliseconds.

By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
minimum pagelist high at 4 times the batch size, we were able to
significantly reduce the latency associated with the
free_pcppages_bulk() function during container exits.:

     nsecs               : count     distribution
         0 -> 1          : 0        |                                      =
  |
         2 -> 3          : 0        |                                      =
  |
         4 -> 7          : 0        |                                      =
  |
         8 -> 15         : 0        |                                      =
  |
        16 -> 31         : 0        |                                      =
  |
        32 -> 63         : 0        |                                      =
  |
        64 -> 127        : 0        |                                      =
  |
       128 -> 255        : 120      |                                      =
  |
       256 -> 511        : 365      |*                                     =
  |
       512 -> 1023       : 201      |                                      =
  |
      1024 -> 2047       : 103      |                                      =
  |
      2048 -> 4095       : 84       |                                      =
  |
      4096 -> 8191       : 87       |                                      =
  |
      8192 -> 16383      : 4777     |**************                        =
  |
     16384 -> 32767      : 10572    |*******************************       =
  |
     32768 -> 65535      : 13544    |**************************************=
**|
     65536 -> 131071     : 12723    |************************************* =
  |
    131072 -> 262143     : 8604     |*************************             =
  |
    262144 -> 524287     : 3659     |**********                            =
  |
    524288 -> 1048575    : 921      |**                                    =
  |
   1048576 -> 2097151    : 122      |                                      =
  |
   2097152 -> 4194303    : 5        |                                      =
  |

avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925

After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
knob to set the minimum pagelist high at a level that effectively
mitigated latency issues, we observed that other containers were no
longer experiencing similar complaints. As a result, we decided to
implement this tuning as a permanent workaround and have deployed it
across all clusters of servers where these containers may be deployed.

>
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -856,6 +856,10 @@ on per-cpu page lists. This entry only changes the=
 value of hot per-cpu
> >  page lists. A user can specify a number like 100 to allocate 1/100th o=
f
> >  each zone between per-cpu lists.
> >
> > +The minimum number of pages that can be stored in per-CPU page lists i=
s
> > +four times the batch value. By writing '-1' to this sysctl, you can se=
t
> > +this minimum value.
>
> I suggest we also describe why an operator would want to set this, and
> the expected effects of that action.

will improve it.

>
> >  The batch value of each per-cpu page list remains the same regardless =
of
> >  the value of the high fraction so allocation latencies are unaffected.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e22ce5675ca..e7313f9d704b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5486,6 +5486,10 @@ static int zone_highsize(struct zone *zone, int =
batch, int cpu_online,
> >       int nr_split_cpus;
> >       unsigned long total_pages;
> >
> > +     /* Setting -1 to set the minimum pagelist size, four times the ba=
tch size */
>
> Some old-timers still use 80-column xterms ;)

will change it.


Regards
Yafang