From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BE0B6C3064D
	for <linux-mm@archiver.kernel.org>; Tue,  2 Jul 2024 09:10:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1CC296B00AD; Tue,  2 Jul 2024 05:10:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 17D166B00AE; Tue,  2 Jul 2024 05:10:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id EE8126B00AF; Tue,  2 Jul 2024 05:10:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id CCC656B00AD
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 05:10:38 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 8A55A1C3017
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 09:10:38 +0000 (UTC)
X-FDA: 82294242156.16.0210780
Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.21])
	by imf29.hostedemail.com (Postfix) with ESMTP id AF5A912001A
	for <linux-mm@kvack.org>; Tue,  2 Jul 2024 09:10:35 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=FsFSkkfA;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719911419;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=B/3se6WSFfJxhNlwmqJUR72U4/KtIL8Lw1SrgwfyKw4=;
	b=V2cfcG1Kl4Aeagb1YDmu0+sa8Nen+EFYrwLgNGPO80T8K28UOQ8W7UnGLJDk9NcXdu/9en
	Er8zz0ygt+fmm1iZISsiiYrr7D+VgfLOdkbRSZNHWiuE+p3m5/tFPJPNCdtwFTmqlm/wSB
	YyCBbAQ9/YqH9nrgDGQH++mLRtfWi2g=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=FsFSkkfA;
	spf=pass (imf29.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.21 as permitted sender) smtp.mailfrom=ying.huang@intel.com;
	dmarc=pass (policy=none) header.from=intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719911419; a=rsa-sha256;
	cv=none;
	b=VkK+FiO41QKjV5AvajPkC8yCcvqK76V5rOBmKHZ8za7lW0kLZ4HdZUWRKP5oTQ5HqiekdS
	vI7pgfKYPhPaHf94EcKqo1wr/DMmt6TEOguTOBfNOLLk7EjzBfLSzeNaAI6q31XRE8jO+s
	8bfzlmEw6gyohFA3m/wyhfTZlmL0KYA=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1719911436; x=1751447436;
  h=from:to:cc:subject:in-reply-to:references:date:
   message-id:mime-version:content-transfer-encoding;
  bh=aQGL5PSWat8COY6Svrb42MPoHBAVDZvuKBp+1ZaY5/s=;
  b=FsFSkkfAm7FabqkDbsJboN3/ItAk70/6mhQ0qWaB+6fbtXzP1OirBS2J
   qW6VA19nktjqk44bHZuVSHITrfy12GquuNGO1BaJmbIqvRq51c+cXq/7J
   wE0oTFhA5YY1rXt6ozC104BJ1A27oqTBRLbZtWqb9smXtBgnoskYZGvTY
   1SZbOnyF7iV/10vuNkatG+nJBmoED1yZoqnPvsdwG5j24m4ym2mv32m0m
   tjO1Z+gmDaxCIavEsb2rwT6aD7KSlPFwCyibx0ipan15ESmLLkSW4v5ss
   bYZsPTw2V51fUbSrk+ygYX46j/HTFr7qhzlDTuSU2zHY61QjhpeejG0ms
   A==;
X-CSE-ConnectionGUID: Ffr8ggVwRqiRaCNHJVEf+A==
X-CSE-MsgGUID: 0pcB/gYLST6xdmrdnHC07A==
X-IronPort-AV: E=McAfee;i="6700,10204,11120"; a="17027212"
X-IronPort-AV: E=Sophos;i="6.09,178,1716274800"; 
   d="scan'208";a="17027212"
Received: from orviesa009.jf.intel.com ([10.64.159.149])
  by orvoesa113.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 02:10:35 -0700
X-CSE-ConnectionGUID: bjkFepklSImpnG7J6juCfw==
X-CSE-MsgGUID: el0dixGOSnyesWQSQtuOzg==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="6.09,178,1716274800"; 
   d="scan'208";a="45966018"
Received: from unknown (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Jul 2024 02:10:34 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,  linux-mm@kvack.org,  Matthew
 Wilcox <willy@infradead.org>,  David Rientjes <rientjes@google.com>,  Mel
 Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH] mm: Enable setting -1 for
 vm.percpu_pagelist_high_fraction to set the minimum pagelist
In-Reply-To: <CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
	(Yafang Shao's message of "Tue, 2 Jul 2024 14:37:38 +0800")
References: <20240701142046.6050-1-laoar.shao@gmail.com>
	<20240701195143.7e8d597abc14b255f3bc4bcd@linux-foundation.org>
	<CALOAHbBZBq=wNGw2N_K9zMp0OW=x2HmOBCVg8c06+zwHiW=H8A@mail.gmail.com>
Date: Tue, 02 Jul 2024 17:08:42 +0800
Message-ID: <874j98noth.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: mo5refjfpm36pyjqcxwedshd1n49rfx9
X-Rspam-User: 
X-Rspamd-Queue-Id: AF5A912001A
X-Rspamd-Server: rspam02
X-HE-Tag: 1719911435-936375
X-HE-Meta: U2FsdGVkX1/pT9NsQLxG6ATcaOSQg9GIy3fIMTXJNIQPENSrXaNXFOOF5HWOmzIyJllDpbao7dEPKZt27WcoQkQqYoUGrWdp6sBbcv3raa1ARpYSxKCtJBiFJVydQR/FQ4WRAZrVSwpq/0YF756iR/Vcny0KlwdgLusNcCZJy1IQHKTAizJHotcXhSReEdlhih4e32+Wr9mPPyhoXKRnByNVmzodnBEg/MYWOGL2TW6nLgX7NAD8V70vnSv4IbGqKzbOn0cwwIeAlhjB9MCLwDHojQ6KbrkSz4F3poL5jzKhAgdiRW+k1eh/TAPQ0qUU9X6N3DpkdZGey/wryCS3VmVdN/QOWXZ941IpUjYELvL3mYYdySpc/am8RdW6/I12HkRBjyFtp2v5S43UYmM7NcaxD2esWbRPEz+LZcSZBuIeW8ZXsjlYXgvIdsITBDXKS8IBlSjlSVsljRJ+vhAWGMUMTAzCGCxWizbVdRvP4Br6G24vn1JTSZBRjer5gybafxBr1kKFUA2OIweir5t428UPQ1Jp9n4buHT3qtiEoaNVHaoRTiErEyu95YRdtGjKQ1bbepz1lEVjqJcOyJzIPIhFB6FfTSPSbUC1RaU3sa924dwZO4hMmDd4ngHQcFzG8dKo62E/yPxGifn79ZB0Ne5Wq/TDj0/ll+7fSlicsPwqlU3fvpFO8K7wtxzU/z1nC35grqnyYM3lr6qEruRk4BM0qBHGRlmInMc0jGAPZ51EhLqG7RepqyvZ2LjVSdOrC+ZJcrpl2qBzhXtcjozDoMB1L8FPAkXSJNxyw+THLzgBXMAcUfNYhk72NRhs44Nr+RGJLRyr7e8lCn5NQqP9Y2cK6skHNlik0Vrs3Lh8yDwfsjo34jmMLcdHskQXwWB9ldxOg7cHDsGtzrDV898hcAvId9BXDZ1IYnr/bY6IBAkt19M//GP9zLxv8CPSEAoGQc4wd2bAJE9CbiTlhxy
 MY/3Qr/f
 IDIfQqvAZ+h9YYtuNPpkjJCwzyoxPlhNGFD/nuDct2ZTdFUL5btfVD8+4E/XvwOdzIjjp4Z9Je39YZl6URn2nYNXHBm6dZr/DN8v3/4Bjrbaj71lHDnF3/Cr8f5/zfQS3XzFyWiqEJIHIlAPfihEcjZH+70HK9ll0bedvMeQC6gRXZkbCiYRudPYNtZkm4d0LfXxeUVNEZtJpNUyBPdMNdFGvKR/CAhEn1Bm6aj26S1r5n6wAn8V1QW6YM1ytxeHJjGMq4YdmaKGyi3jLrpomgwnEGzi6y9DexzbzDadHyairxe987LaI0LnYobyzPSy9M1i+vrgKVkV9r/6482Q7S5QoTw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yafang Shao <laoar.shao@gmail.com> writes:

> On Tue, Jul 2, 2024 at 10:51=E2=80=AFAM Andrew Morton <akpm@linux-foundat=
ion.org> wrote:
>>
>> On Mon,  1 Jul 2024 22:20:46 +0800 Yafang Shao <laoar.shao@gmail.com> wr=
ote:
>>
>> > Currently, we're encountering latency spikes in our container environm=
ent
>> > when a specific container with multiple Python-based tasks exits. These
>> > tasks may hold the zone->lock for an extended period, significantly
>> > impacting latency for other containers attempting to allocate memory.
>>
>> Is this locking issue well understood?  Is anyone working on it?  A
>> reasonably detailed description of the issue and a description of any
>> ongoing work would be helpful here.
>
> In our containerized environment, we have a specific type of container
> that runs 18 processes, each consuming approximately 6GB of RSS. These
> processes are organized as separate processes rather than threads due
> to the Python Global Interpreter Lock (GIL) being a bottleneck in a
> multi-threaded setup. Upon the exit of these containers, other
> containers hosted on the same machine experience significant latency
> spikes.
>
> Our investigation using perf tracing revealed that the root cause of
> these spikes is the simultaneous execution of exit_mmap() by each of
> the exiting processes. This concurrent access to the zone->lock
> results in contention, which becomes a hotspot and negatively impacts
> performance. The perf results clearly indicate this contention as a
> primary contributor to the observed latency issues.
>
> +   77.02%     0.00%  uwsgi    [kernel.kallsyms]
>            [k] mmput                                   =E2=96=92
> -   76.98%     0.01%  uwsgi    [kernel.kallsyms]
>            [k] exit_mmap                               =E2=96=92
>    - 76.97% exit_mmap
>                                                        =E2=96=92
>       - 58.58% unmap_vmas
>                                                        =E2=96=92
>          - 58.55% unmap_single_vma
>                                                        =E2=96=92
>             - unmap_page_range
>                                                        =E2=96=92
>                - 58.32% zap_pte_range
>                                                        =E2=96=92
>                   - 42.88% tlb_flush_mmu
>                                                        =E2=96=92
>                      - 42.76% free_pages_and_swap_cache
>                                                        =E2=96=92
>                         - 41.22% release_pages
>                                                        =E2=96=92
>                            - 33.29% free_unref_page_list
>                                                        =E2=96=92
>                               - 32.37% free_unref_page_commit
>                                                        =E2=96=92
>                                  - 31.64% free_pcppages_bulk
>                                                        =E2=96=92
>                                     + 28.65% _raw_spin_lock
>                                                        =E2=96=92
>                                       1.28% __list_del_entry_valid
>                                                        =E2=96=92
>                            + 3.25% folio_lruvec_lock_irqsave
>                                                        =E2=96=92
>                            + 0.75% __mem_cgroup_uncharge_list
>                                                        =E2=96=92
>                              0.60% __mod_lruvec_state
>                                                        =E2=96=92
>                           1.07% free_swap_cache
>                                                        =E2=96=92
>                   + 11.69% page_remove_rmap
>                                                        =E2=96=92
>                     0.64% __mod_lruvec_page_state
>       - 17.34% remove_vma
>                                                        =E2=96=92
>          - 17.25% vm_area_free
>                                                        =E2=96=92
>             - 17.23% kmem_cache_free
>                                                        =E2=96=92
>                - 17.15% __slab_free
>                                                        =E2=96=92
>                   - 14.56% discard_slab
>                                                        =E2=96=92
>                        free_slab
>                                                        =E2=96=92
>                        __free_slab
>                                                        =E2=96=92
>                        __free_pages
>                                                        =E2=96=92
>                      - free_unref_page
>                                                        =E2=96=92
>                         - 13.50% free_unref_page_commit
>                                                        =E2=96=92
>                            - free_pcppages_bulk
>                                                        =E2=96=92
>                               + 13.44% _raw_spin_lock
>
> By enabling the mm_page_pcpu_drain() we can find the detailed stack:
>
>           <...>-1540432 [224] d..3. 618048.023883: mm_page_pcpu_drain:
> page=3D0000000035a1b0b7 pfn=3D0x11c19c72 order=3D0 migratetyp
> e=3D1
>            <...>-1540432 [224] d..3. 618048.023887: <stack trace>
>  =3D> free_pcppages_bulk
>  =3D> free_unref_page_commit
>  =3D> free_unref_page_list
>  =3D> release_pages
>  =3D> free_pages_and_swap_cache
>  =3D> tlb_flush_mmu
>  =3D> zap_pte_range
>  =3D> unmap_page_range
>  =3D> unmap_single_vma
>  =3D> unmap_vmas
>  =3D> exit_mmap
>  =3D> mmput
>  =3D> do_exit
>  =3D> do_group_exit
>  =3D> get_signal
>  =3D> arch_do_signal_or_restart
>  =3D> exit_to_user_mode_prepare
>  =3D> syscall_exit_to_user_mode
>  =3D> do_syscall_64
>  =3D> entry_SYSCALL_64_after_hwframe
>
> The servers experiencing these issues are equipped with impressive
> hardware specifications, including 256 CPUs and 1TB of memory, all
> within a single NUMA node. The zoneinfo is as follows,
>
> Node 0, zone   Normal
>   pages free     144465775
>         boost    0
>         min      1309270
>         low      1636587
>         high     1963904
>         spanned  564133888
>         present  296747008
>         managed  291974346
>         cma      0
>         protection: (0, 0, 0, 0)
> ...
> ...
>   pagesets
>     cpu: 0
>               count: 2217
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 1
>               count: 4510
>               high:  6392
>               batch: 63
>   vm stats threshold: 125
>     cpu: 2
>               count: 3059
>               high:  6392
>               batch: 63
>
> ...
>
> The high is around 100 times the batch size.
>
> We also traced the latency associated with the free_pcppages_bulk()
> function during the container exit process:
>
> 19:48:54
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                    =
    |
>          2 -> 3          : 0        |                                    =
    |
>          4 -> 7          : 0        |                                    =
    |
>          8 -> 15         : 0        |                                    =
    |
>         16 -> 31         : 0        |                                    =
    |
>         32 -> 63         : 0        |                                    =
    |
>         64 -> 127        : 0        |                                    =
    |
>        128 -> 255        : 0        |                                    =
    |
>        256 -> 511        : 148      |*****************                   =
    |
>        512 -> 1023       : 334      |************************************=
****|
>       1024 -> 2047       : 33       |***                                 =
    |
>       2048 -> 4095       : 5        |                                    =
    |
>       4096 -> 8191       : 7        |                                    =
    |
>       8192 -> 16383      : 12       |*                                   =
    |
>      16384 -> 32767      : 30       |***                                 =
    |
>      32768 -> 65535      : 21       |**                                  =
    |
>      65536 -> 131071     : 15       |*                                   =
    |
>     131072 -> 262143     : 27       |***                                 =
    |
>     262144 -> 524287     : 84       |**********                          =
    |
>     524288 -> 1048575    : 203      |************************            =
    |
>    1048576 -> 2097151    : 284      |**********************************  =
    |
>    2097152 -> 4194303    : 327      |************************************=
*** |
>    4194304 -> 8388607    : 215      |*************************           =
    |
>    8388608 -> 16777215   : 116      |*************                       =
    |
>   16777216 -> 33554431   : 47       |*****                               =
    |
>   33554432 -> 67108863   : 8        |                                    =
    |
>   67108864 -> 134217727  : 3        |                                    =
    |
>
> avg =3D 3066311 nsecs, total: 5887317501 nsecs, count: 1920
>
> The latency can reach tens of milliseconds.
>
> By adjusting the vm.percpu_pagelist_high_fraction parameter to set the
> minimum pagelist high at 4 times the batch size, we were able to
> significantly reduce the latency associated with the
> free_pcppages_bulk() function during container exits.:
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                    =
    |
>          2 -> 3          : 0        |                                    =
    |
>          4 -> 7          : 0        |                                    =
    |
>          8 -> 15         : 0        |                                    =
    |
>         16 -> 31         : 0        |                                    =
    |
>         32 -> 63         : 0        |                                    =
    |
>         64 -> 127        : 0        |                                    =
    |
>        128 -> 255        : 120      |                                    =
    |
>        256 -> 511        : 365      |*                                   =
    |
>        512 -> 1023       : 201      |                                    =
    |
>       1024 -> 2047       : 103      |                                    =
    |
>       2048 -> 4095       : 84       |                                    =
    |
>       4096 -> 8191       : 87       |                                    =
    |
>       8192 -> 16383      : 4777     |**************                      =
    |
>      16384 -> 32767      : 10572    |*******************************     =
    |
>      32768 -> 65535      : 13544    |************************************=
****|
>      65536 -> 131071     : 12723    |************************************=
*   |
>     131072 -> 262143     : 8604     |*************************           =
    |
>     262144 -> 524287     : 3659     |**********                          =
    |
>     524288 -> 1048575    : 921      |**                                  =
    |
>    1048576 -> 2097151    : 122      |                                    =
    |
>    2097152 -> 4194303    : 5        |                                    =
    |
>
> avg =3D 103814 nsecs, total: 5805802787 nsecs, count: 55925
>
> After successfully tuning the vm.percpu_pagelist_high_fraction sysctl
> knob to set the minimum pagelist high at a level that effectively
> mitigated latency issues, we observed that other containers were no
> longer experiencing similar complaints. As a result, we decided to
> implement this tuning as a permanent workaround and have deployed it
> across all clusters of servers where these containers may be deployed.

Thanks for your detailed data.

IIUC, the latency of free_pcppages_bulk() during process exiting
shouldn't be a problem?  Because users care more about the total time of
process exiting, that is, throughput.  And I suspect that the zone->lock
contention and page allocating/freeing throughput will be worse with
your configuration?

But the latency of free_pcppages_bulk() and page allocation in other
processes is a problem.  And your configuration can help it.

Another choice is to change CONFIG_PCP_BATCH_SCALE_MAX.  In that way,
you have a normal PCP size (high) but smaller PCP batch.  I guess that
may help both latency and throughput in your system.  Could you give it
a try?

[snip]

--
Best Regards,
Huang, Ying