From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 329A4CD3431
	for <linux-mm@archiver.kernel.org>; Wed,  4 Sep 2024 09:16:35 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8BCB56B00AF; Wed,  4 Sep 2024 05:16:34 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 86B716B01F2; Wed,  4 Sep 2024 05:16:34 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 6E5476B01F6; Wed,  4 Sep 2024 05:16:34 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 4D7586B00AF
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 05:16:34 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id BA0B7120E9D
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 09:16:33 +0000 (UTC)
X-FDA: 82526500266.13.9D7C43B
Received: from mail-ua1-f47.google.com (mail-ua1-f47.google.com [209.85.222.47])
	by imf28.hostedemail.com (Postfix) with ESMTP id DDCBAC0007
	for <linux-mm@kvack.org>; Wed,  4 Sep 2024 09:16:31 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Lk0qlJJp;
	spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725441264;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=91sNbecszOwzw0QRWul48uq73nEoQ/YU8yT66MgJfuQ=;
	b=SES3JyMOse7ZlHC61xOQvgrhPrOmLF1y1IyydA48mIwBj3oRHfQIPRF7ajuKG9qGGw2suO
	8+OcBi8dB7/rk51cshJ5ZL/BqVgx0V+S9WtzvD4m0NJC9yGmX7X4B73ABxdZ4Peefy/5hs
	7SZMKsQM0fDzvcaSYxvHBTqpAxxwTsE=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Lk0qlJJp;
	spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725441264; a=rsa-sha256;
	cv=none;
	b=PxLk9YB+yMoLx5C7Df3WjpexFjkGyeiMURtiWPQ9/p2Sw4MQkVnMU0lPceY7GIuEyhyD5x
	NexlEi0uBzV+X9VVRUPCH2ZFNKPoSYOl7ONWW/Ayutr3OEczt3XqNMrxwRZWk6cgIm62ow
	DsT2XyiXD6R9CV1B8OBz54AITaaq7/Y=
Received: by mail-ua1-f47.google.com with SMTP id a1e0cc1a2514c-846bc2104c8so317062241.0
        for <linux-mm@kvack.org>; Wed, 04 Sep 2024 02:16:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725441391; x=1726046191; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=91sNbecszOwzw0QRWul48uq73nEoQ/YU8yT66MgJfuQ=;
        b=Lk0qlJJpkfWcnb1LZ+8/UY3Tc6uqusAdzEqVvzsL8E1SB7ygbvC2ZZSiwjvyEYRsZS
         b7HCA5V83oflNyA+/v1b2P6TMflspuBiD+l/yMRHNXlgYq7+pztNj3VtOoNxU9PnXHHn
         ZkbNCIr1rY1dYdpCBLfKt9T25FKZ0173jG/Q+MwsVK5CuJg3KP+eUjPBm8sP9yB7K7RY
         JTVVARSAqPt3tQ8ZP9zkmWHq0iedMLrsX0xXl+v2+WIU1/9r4v1NuNrwRRWKYAhs50SH
         oG32B3w0USLE5f8QAws3MEQnZJ5LUSmYoni6kCWwhlvUSvr6pcn5bC3tJ7EHOHsEZGhB
         LKaA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725441391; x=1726046191;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=91sNbecszOwzw0QRWul48uq73nEoQ/YU8yT66MgJfuQ=;
        b=AYq5/itPXw5wW0rFgLv8yS6xUY4wdiiMq0yNp0o3ylrEN6usFdgDBaK6EkQrDxxfht
         rYobxgYnxoyXKabeaZI2KqkZM5KXRRgueeX3eD+jsEbTNvZ2mUEoAD0dNb01qMBSLeoQ
         PSW8QWAQqTONLDIliYNGx6Y4gTU6eTmZCpMjFBVvmQc4/NMYrdZQmT2+TfcZO0WmvJbI
         K7kZB9NMa7s6UC22rBzuFvdjKKH+3/vephTwoOglUtcf2+gx4kq7XbUe/Q91lJU9OTE0
         4Z8JEmR5PAAYQ4GhDbrYczVwE6t7WK0i8SMIvah8tjr183Hrtg/QgW2S71lr/uXHFCkS
         WDpw==
X-Forwarded-Encrypted: i=1; AJvYcCX/H8Oxwd3/MUIOlG+JUh5FmZYUGc+e9ryBC2yTnLKjWQSe0tt5+pXfx84Yew1XqzjiuH8tZb90LQ==@kvack.org
X-Gm-Message-State: AOJu0Yw3+p5PEVPa7EWE5qIe2tOnoEIANbMGqWNqVH0nBGU0hWwzs5w9
	OtaDNeLzya+CV2ji9scmaVJuv44SBkIKmjdsAX/M1fiBR6LbBmT6IU1cQJt+gHeAhUVP3e/Yxrz
	Ufjj54K03xOuPuxNI0hTXIYjltoWxwGI7mj4QWg==
X-Google-Smtp-Source: AGHT+IFNCFzpgp7HilV6aNl+E7w7ZoHrpVIaxChMt0OEAh+EnavZjow925kCbCCaoGTGHpIOZRazcuO1joORWaiaH7I=
X-Received: by 2002:a05:6102:f06:b0:492:84ac:2eb7 with SMTP id
 ada2fe7eead31-49bba729b41mr435608137.11.1725441390653; Wed, 04 Sep 2024
 02:16:30 -0700 (PDT)
MIME-Version: 1.0
References: <20240805153639.1057-1-justinjiang@vivo.com> <20240805153639.1057-3-justinjiang@vivo.com>
In-Reply-To: <20240805153639.1057-3-justinjiang@vivo.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 4 Sep 2024 21:16:19 +1200
Message-ID: <CAGsJ_4zXtJvBdgpDs+yyEwfdJ0gy+_dgrWLF1zxMgBbaLBeiYA@mail.gmail.com>
Subject: Re: [PATCH v3 2/2] mm: tlb: add tlb swap entries batch async release
To: Zhiguo Jiang <justinjiang@vivo.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Will Deacon <will@kernel.org>, 
	"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>, Nick Piggin <npiggin@gmail.com>, 
	Peter Zijlstra <peterz@infradead.org>, Arnd Bergmann <arnd@arndb.de>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeel.butt@linux.dev>, 
	Muchun Song <muchun.song@linux.dev>, linux-arch@vger.kernel.org, cgroups@vger.kernel.org, 
	David Hildenbrand <david@redhat.com>, opensource.kernel@vivo.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: DDCBAC0007
X-Stat-Signature: if7mnjp4a1hu9xny8o3y35bpnwp6tg5h
X-Rspam-User: 
X-HE-Tag: 1725441391-997433
X-HE-Meta: U2FsdGVkX18nhCLwhzYh5L9+Q490vtmaPWiW6lutZiBmuNPYYiZofcWKGY3x0zIJvnmTy+eOHq2PlwEHhMk24zzHVNlY6EDEuqbQVHZqfj0Qhh9boVokPxfk3xHa66NK1Qhpq909u9FqiCHhZ/wW7W/03oNR6JEEvzfXg5pj6Wnk3z0SqcegHLaBkWEkM33uKoxlyp+SkSL1UENDkiQgxFgxhBRG1aNWjWxt7JlI/7atcrMtk64L4kdr3i0qoNUWzbz+miWgkbV3CH6AaUqSG/iUdAa933WFvKwLoptWHF2q9+Hrkd1aSROMAX0lVP3h8icxeb1Xdn6oM5VYwm18ZpfA2vPnA6QaEESV925ZRMg2pVg0+SZaQjjfubeNlPBRqpddPyk8WZXqVyodXfODuSQUDNuw0qHSkUUyyuHQMrTyggw8peC6DCc1CfzY5u3Psc3aFYdLDf2vDtw9qDir1Lja0S4JNuUFRzAzn+4SQp3o2vmxX6dZ/byDdgp3clpuklHWv7HEbugKNHplhHUy0Z5Ik18m16zo+f4SAXd7MxaJkgHHwjWhMSie5OWu1fK1uM8MRVZ9QobOSJaNM784MNY/tQ0YjKfSrT3lCqx/PJ51gzM4iz287hvox1EMIlcSkB0OGRFB2IfsuWRjZbc6BWiZKfcSd8E3u6pM31OLSM5JUWivguEuKATSz3PefzdrRlPKk9b7etXOaRzIL2jErTIhOvSxVmMqTiUQfTCOk9VthH+FpESwfTanOIBpZWAkG6lJyj4qtTSwPAKoA49iyIbTq6bVCFOi32LFrWED3NjknoQeahwbwKiin+HXwKcGl6mLVifNqn/fEMEYaaWCDCoYvl6mhG5jXmdwpZop/2FxPmUIFLdOVU9VIsU5DK6uyG+pM1xziLgv2JwDlN3/ccfp7xrAfP8fryOYxGUutPolPzaiP96LHQFf+lg62yPF1ZR9LECKQPKdSZyCkkY
 cI1TBBND
 y97SfuByyutvx716gHSU8QHGEiyBAvWXHcPnfMQPsAWKJSBJCpujpJz6JRhCQ93LhxS1umLlp03ZOdWrTv5/BJI58+aCbNPS5JXK7VLEcakT1AMIf4tixINODLWP7akWTLPv3AFu/IQSp36ugGvBPq122/R8ScvAeuRcoPSf2+h11GBdj8UIlRT0UI017lDqUIRfs2eduH2cDO+CkGeh6a7ZMv2IQM9u3PiC8OixKGTsEQPvZfgaw1RgDX+bpKE/8hxbNFJyR2gdCP/xS5r/hRQCQvdSUexQvV5whLSdpCImF0FIkLEsVpxr6hPhpCRbFNN/5R7kuPUXVpZ6nLLPfgTgcdXw1iSy09Z8fMZyoHZ3tWRBjS+o00WZvAg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Aug 6, 2024 at 3:36=E2=80=AFAM Zhiguo Jiang <justinjiang@vivo.com> =
wrote:
>
> One of the main reasons for the prolonged exit of the process with
> independent mm is the time-consuming release of its swap entries.
> The proportion of swap memory occupied by the process increases over
> time due to high memory pressure triggering to reclaim anonymous folio
> into swapspace, e.g., in Android devices, we found this proportion can
> reach 60% or more after a period of time. Additionally, the relatively
> lengthy path for releasing swap entries further contributes to the
> longer time required to release swap entries.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After booting up, start 15 processes first, and then observe the
> physical memory size occupied by the last launched process at different
> time points.
> Example: The process launched last: com.qiyi.video
> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  |
> -------------------------------------------------------------------
> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  |
> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  |
> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  |
> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  |
> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  |
> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  |
> Note: min - minute.
>
> When there are multiple processes with independent mm and the high
> memory pressure in system, if the large memory required process is
> launched at this time, system will is likely to trigger the instantaneous
> killing of many processes with independent mm. Due to multiple exiting
> processes occupying multiple CPU core resources for concurrent execution,
> leading to some issues such as the current non-exiting and important
> processes lagging.
>
> To solve this problem, we have introduced the multiple exiting process
> asynchronous swap entries release mechanism, which isolates and caches
> swap entries occupied by multiple exiting processes, and hands them over
> to an asynchronous kworker to complete the release. This allows the
> exiting processes to complete quickly and release CPU resources. We have
> validated this modification on the Android products and achieved the
> expected benefits.
>
> Testing Platform: 8GB RAM
> Testing procedure:
> After restarting the machine, start 15 app processes first, and then
> start the camera app processes, we monitor the cold start and preview
> time datas of the camera app processes.
>
> Test datas of camera processes cold start time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
>
> Test datas of camera processes preview time (unit: millisecond):
> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
>
> Base on the average of the six sets of test datas above, we can see that
> the benefit datas of the modified patch:
> 1. The cold start time of camera app processes has reduced by about 20%.
> 2. The preview time of camera app processes has reduced by about 42%.
>
> It offers several benefits:
> 1. Alleviate the high system cpu loading caused by multiple exiting
>    processes running simultaneously.
> 2. Reduce lock competition in swap entry free path by an asynchronous
>    kworker instead of multiple exiting processes parallel execution.
> 3. Release pte_present memory occupied by exiting processes more
>    efficiently.
>
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
>  arch/s390/include/asm/tlb.h |   8 +
>  include/asm-generic/tlb.h   |  44 ++++++
>  include/linux/mm_types.h    |  58 +++++++
>  mm/memory.c                 |   3 +-
>  mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++++++++++
>  5 files changed, 408 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index e95b2c8081eb..3f681f63390f
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct mmu_ga=
ther *tlb,
>                 struct page *page, bool delay_rmap, int page_size);
>  static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
>                 struct page *page, unsigned int nr_pages, bool delay_rmap=
);
> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr);
>
>  #define tlb_flush tlb_flush
>  #define pte_free_tlb pte_free_tlb
> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(struct mmu=
_gather *tlb,
>         return false;
>  }
>
> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr)
> +{
> +       return false;
> +}
> +
>  static inline void tlb_flush(struct mmu_gather *tlb)
>  {
>         __tlb_flush_mm_lazy(tlb->mm);
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 709830274b75..8b4d516b35b8
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather *tlb, =
struct vm_area_struct *vma);
>  static inline void tlb_flush_rmaps(struct mmu_gather *tlb, struct vm_are=
a_struct *vma) { }
>  #endif
>
> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> +struct mmu_swap_batch {
> +       struct mmu_swap_batch *next;
> +       unsigned int nr;
> +       unsigned int max;
> +       encoded_swpentry_t encoded_entrys[];
> +};
> +
> +#define MAX_SWAP_GATHER_BATCH  \
> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void *))
> +
> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER_BATCH)
> +
> +struct mmu_swap_gather {
> +       /*
> +        * the asynchronous kworker to batch
> +        * release swap entries
> +        */
> +       struct work_struct free_work;
> +
> +       /* batch cache swap entries */
> +       unsigned int batch_count;
> +       struct mmu_swap_batch *active;
> +       struct mmu_swap_batch local;
> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> +};
> +
> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +               swp_entry_t entry, int nr);
> +#endif
> +
>  /*
>   * struct mmu_gather is an opaque type used by the mm code for passing a=
round
>   * any data needed by arch specific code for tlb_remove_page.
> @@ -343,6 +374,18 @@ struct mmu_gather {
>         unsigned int            vma_exec : 1;
>         unsigned int            vma_huge : 1;
>         unsigned int            vma_pfn  : 1;
> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> +       /*
> +        * Two states of releasing swap entries
> +        * asynchronously:
> +        * swp_freeable - have opportunity to
> +        * release asynchronously future
> +        * swp_freeing - be releasing asynchronously.
> +        */
> +       unsigned int            swp_freeable : 1;
> +       unsigned int            swp_freeing : 1;
> +       unsigned int            swp_disable : 1;
> +#endif
>
>         unsigned int            batch_count;
>
> @@ -354,6 +397,7 @@ struct mmu_gather {
>  #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
>         unsigned int page_size;
>  #endif
> +       struct mmu_swap_gather *swp;
>  #endif
>  };
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 165c58b12ccc..2f66303f1519
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -283,6 +283,64 @@ typedef struct {
>         unsigned long val;
>  } swp_entry_t;
>
> +/*
> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> + *
> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its the highe=
st
> + * bit indicating extra context-dependent information. Only used in swp_=
entry
> + * asynchronous release path by mmu_swap_gather.
> + */
> +typedef struct {
> +       unsigned long val;
> +} encoded_swpentry_t;
> +
> +/*
> + * The next item in an encoded_swpentry_t array is the "nr" argument, sp=
ecifying the
> + * total number of consecutive swap entries associated with the same fol=
io. If this
> + * bit is not set, "nr" is implicitly 1.
> + *
> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_type bit=
s: 58 ~ 62.
> + * Bit63 can be used here.
> + */
> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LONG - 1))
> +
> +static __always_inline encoded_swpentry_t
> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> +{
> +       encoded_swpentry_t ret;
> +
> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> +       ret.val =3D flags | entry.val;
> +       return ret;
> +}
> +
> +static inline unsigned long encoded_swpentry_flags(encoded_swpentry_t en=
try)
> +{
> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> +}
> +
> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_t entry=
)
> +{
> +       swp_entry_t ret;
> +
> +       ret.val =3D ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> +       return ret;
> +}
> +
> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(unsigned l=
ong nr)
> +{
> +       encoded_swpentry_t ret;
> +
> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> +       ret.val =3D nr;
> +       return ret;
> +}
> +
> +static __always_inline unsigned long encoded_nr_swpentrys(encoded_swpent=
ry_t entry)
> +{
> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val);
> +}
> +
>  /**
>   * struct folio - Represents a contiguous set of bytes.
>   * @flags: Identical to the page flags.
> diff --git a/mm/memory.c b/mm/memory.c
> index d6a9dcddaca4..023a8adcb67c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mmu_gathe=
r *tlb,
>                         if (!should_zap_cows(details))
>                                 continue;
>                         rss[MM_SWAPENTS] -=3D nr;
> -                       free_swap_and_cache_nr(entry, nr);
> +                       if (!__tlb_remove_swap_entries(tlb, entry, nr))
> +                               free_swap_and_cache_nr(entry, nr);
>                 } else if (is_migration_entry(entry)) {
>                         folio =3D pfn_swap_entry_folio(entry);
>                         if (!should_zap_folio(details, folio))
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 99b3e9408aa0..33dc9d1faff9
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -9,11 +9,303 @@
>  #include <linux/smp.h>
>  #include <linux/swap.h>
>  #include <linux/rmap.h>
> +#include <linux/oom.h>
>
>  #include <asm/pgalloc.h>
>  #include <asm/tlb.h>
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
> +/*
> + * The swp_entry asynchronous release mechanism for multiple processes w=
ith
> + * independent mm exiting simultaneously.
> + *
> + * During the multiple exiting processes releasing their own mm simultan=
eously,
> + * the swap entries in the exiting processes are handled by isolating, c=
aching
> + * and handing over to an asynchronous kworker to complete the release.
> + *
> + * The conditions for the exiting process entering the swp_entry asynchr=
onous
> + * release path:
> + * 1. The exiting process's MM_SWAPENTS count is >=3D SWAP_CLUSTER_MAX, =
avoiding
> + *    to alloc struct mmu_swap_gather frequently.
> + * 2. The number of exiting processes is >=3D NR_MIN_EXITING_PROCESSES.

Hi Zhiguo,

I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It seems th=
at
batched swap entry freeing, even with one process, could be a
bottleneck for a single
process based on the data from this patch:

mm: attempt to batch free swap entries for zap_pte_range()
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=3Dmm-=
stable&id=3Dbea67dcc5ee
"munmap bandwidth becomes 3X faster."

So what would happen if you simply set NR_MIN_EXITING_PROCESSES to 1?

> + *
> + * Since the time for determining the number of exiting processes is dyn=
amic,
> + * the exiting process may start to enter the swp_entry asynchronous rel=
ease
> + * at the beginning or middle stage of the exiting process's swp_entry r=
elease
> + * path.
> + *
> + * Once an exiting process enters the swp_entry asynchronous release, al=
l remaining
> + * swap entries in this exiting process need to be fully released by asy=
nchronous
> + * kworker theoretically.

Freeing a slot can indeed release memory from `zRAM`, potentially returning
it to the system for allocation. Your patch frees swap slots asynchronously=
;
I assume this doesn=E2=80=99t slow down the memory freeing process for `zRA=
M`, or
could it even slow down the freeing of `zRAM` memory? Freeing compressed
memory might not be as crucial compared to freeing uncompressed memory with
present PTEs?

> + *
> + * The function of the swp_entry asynchronous release:
> + * 1. Alleviate the high system cpu load caused by multiple exiting proc=
esses
> + *    running simultaneously.
> + * 2. Reduce lock competition in swap entry free path by an asynchronous=
 kworker
> + *    instead of multiple exiting processes parallel execution.
> + * 3. Release pte_present memory occupied by exiting processes more effi=
ciently.
> + */
> +
> +/*
> + * The min number of exiting processes required for swp_entry asynchrono=
us release
> + */
> +#define NR_MIN_EXITING_PROCESSES 2
> +
> +static atomic_t nr_exiting_processes =3D ATOMIC_INIT(0);
> +static struct kmem_cache *swap_gather_cachep;
> +static struct workqueue_struct *swapfree_wq;
> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
> +
> +static int __init tlb_swap_async_free_setup(void)
> +{
> +       swapfree_wq =3D alloc_workqueue("smfree_wq", WQ_UNBOUND |
> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> +       if (!swapfree_wq)
> +               goto fail;
> +
> +       swap_gather_cachep =3D kmem_cache_create("swap_gather",
> +               sizeof(struct mmu_swap_gather),
> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUNT,
> +               NULL);
> +       if (!swap_gather_cachep)
> +               goto kcache_fail;
> +
> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
> +       return 0;
> +
> +kcache_fail:
> +       destroy_workqueue(swapfree_wq);
> +fail:
> +       return -ENOMEM;
> +}
> +postcore_initcall(tlb_swap_async_free_setup);
> +
> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_gather)
> +{
> +       struct mmu_swap_batch *swap_batch, *next;
> +
> +       for (swap_batch =3D swap_gather->local.next; swap_batch; swap_bat=
ch =3D next) {
> +               next =3D swap_batch->next;
> +               free_page((unsigned long)swap_batch);
> +       }
> +       swap_gather->local.next =3D NULL;
> +       kmem_cache_free(swap_gather_cachep, swap_gather);
> +}
> +
> +static void tlb_swap_async_free_work(struct work_struct *w)
> +{
> +       int i, nr_multi, nr_free;
> +       swp_entry_t start_entry;
> +       struct mmu_swap_batch *swap_batch;
> +       struct mmu_swap_gather *swap_gather =3D container_of(w,
> +               struct mmu_swap_gather, free_work);
> +
> +       /* Release swap entries cached in mmu_swap_batch. */
> +       for (swap_batch =3D &swap_gather->local; swap_batch && swap_batch=
->nr;
> +           swap_batch =3D swap_batch->next) {
> +               nr_free =3D 0;
> +               for (i =3D 0; i < swap_batch->nr; i++) {
> +                       if (unlikely(encoded_swpentry_flags(swap_batch->e=
ncoded_entrys[i]) &
> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) {
> +                               start_entry =3D encoded_swpentry_data(swa=
p_batch->encoded_entrys[i]);
> +                               nr_multi =3D encoded_nr_swpentrys(swap_ba=
tch->encoded_entrys[++i]);
> +                               free_swap_and_cache_nr(start_entry, nr_mu=
lti);
> +                               nr_free +=3D 2;
> +                       } else {
> +                               start_entry =3D encoded_swpentry_data(swa=
p_batch->encoded_entrys[i]);
> +                               free_swap_and_cache_nr(start_entry, 1);
> +                               nr_free++;
> +                       }
> +               }
> +               swap_batch->nr -=3D nr_free;
> +               VM_BUG_ON(swap_batch->nr);
> +       }
> +       __tlb_swap_gather_free(swap_gather);
> +}
> +
> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
> +{
> +       /*
> +        * Only the exiting processes with the MM_SWAPENTS counter >=3D
> +        * SWAP_CLUSTER_MAX have the opportunity to release their swap
> +        * entries by asynchronous kworker.
> +        */
> +       if (!task_is_dying() ||
> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MAX)
> +               return true;
> +
> +       atomic_inc(&nr_exiting_processes);
> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PROCESSES=
)
> +               tlb->swp_freeable =3D 1;
> +       else
> +               tlb->swp_freeing =3D 1;
> +
> +       return false;
> +}
> +
> +/**
> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structure
> + * for swp_entry tear-down.
> + * @tlb: the mmu_swap_gather structure belongs to tlb
> + */
> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
> +{
> +       tlb->swp =3D kmem_cache_alloc(swap_gather_cachep, GFP_ATOMIC | GF=
P_NOWAIT);
> +       if (unlikely(!tlb->swp))
> +               return false;
> +
> +       tlb->swp->local.next  =3D NULL;
> +       tlb->swp->local.nr    =3D 0;
> +       tlb->swp->local.max   =3D ARRAY_SIZE(tlb->swp->__encoded_entrys);
> +
> +       tlb->swp->active      =3D &tlb->swp->local;
> +       tlb->swp->batch_count =3D 0;
> +
> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
> +       return true;
> +}
> +
> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
> +{
> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
> +               return;
> +
> +       tlb->swp =3D NULL;
> +       tlb->swp_freeable =3D 0;
> +       tlb->swp_freeing =3D 0;
> +       tlb->swp_disable =3D 0;
> +
> +       if (__tlb_swap_gather_mmu_check(tlb))
> +               return;
> +
> +       /*
> +        * If the exiting process meets the conditions of
> +        * swp_entry asynchronous release, an mmu_swap_gather
> +        * structure will be initialized.
> +        */
> +       if (tlb->swp_freeing)
> +               __tlb_swap_gather_init(tlb);
> +}
> +
> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, bool fin=
ish)
> +{
> +       queue_work(swapfree_wq, &tlb->swp->free_work);
> +       tlb->swp =3D NULL;
> +       if (!finish)
> +               __tlb_swap_gather_init(tlb);
> +}
> +
> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
> +{
> +       struct mmu_swap_batch *swap_batch;
> +
> +       if (tlb->swp->batch_count =3D=3D MAX_SWAP_GATHER_BATCH_COUNT)
> +               goto free;
> +
> +       swap_batch =3D (void *)__get_free_page(GFP_ATOMIC | GFP_NOWAIT);
> +       if (unlikely(!swap_batch))
> +               goto free;
> +
> +       swap_batch->next =3D NULL;
> +       swap_batch->nr   =3D 0;
> +       swap_batch->max  =3D MAX_SWAP_GATHER_BATCH;
> +
> +       tlb->swp->active->next =3D swap_batch;
> +       tlb->swp->active =3D swap_batch;
> +       tlb->swp->batch_count++;
> +       return true;
> +free:
> +       /* batch move to wq */
> +       __tlb_swap_gather_queuework(tlb, false);
> +       return false;
> +}
> +
> +/**
> + * __tlb_remove_swap_entries - the swap entries in exiting process are
> + * isolated, batch cached in struct mmu_swap_batch.
> + * @tlb: the current mmu_gather
> + * @entry: swp_entry to be isolated and cached
> + * @nr: the number of consecutive entries starting from entry parameter.
> + */
> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> +                            swp_entry_t entry, int nr)
> +{
> +       struct mmu_swap_batch *swap_batch;
> +       unsigned long flags =3D 0;
> +       bool ret =3D false;
> +
> +       if (tlb->swp_disable)
> +               return ret;
> +
> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> +               return ret;
> +
> +       if (tlb->swp_freeable) {
> +               if (atomic_read(&nr_exiting_processes) <
> +                   NR_MIN_EXITING_PROCESSES)
> +                       return ret;
> +               /*
> +                * If the current number of exiting processes
> +                * is >=3D NR_MIN_EXITING_PROCESSES, the exiting
> +                * process with swp_freeable state will enter
> +                * swp_freeing state to start releasing its
> +                * remaining swap entries by the asynchronous
> +                * kworker.
> +                */
> +               tlb->swp_freeable =3D 0;
> +               tlb->swp_freeing =3D 1;
> +       }
> +
> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
> +               return ret;
> +
> +       swap_batch =3D tlb->swp->active;
> +       if (unlikely(swap_batch->nr >=3D swap_batch->max - 1)) {
> +               __tlb_swap_gather_queuework(tlb, false);
> +               return ret;
> +       }
> +
> +       if (likely(nr =3D=3D 1)) {
> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D encode_s=
wpentry(entry, flags);
> +       } else {
> +               flags |=3D ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D encode_s=
wpentry(entry, flags);
> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D encode_n=
r_swpentrys(nr);
> +       }
> +       ret =3D true;
> +
> +       if (swap_batch->nr >=3D swap_batch->max - 1) {
> +               if (!__tlb_swap_next_batch(tlb))
> +                       goto exit;
> +               swap_batch =3D tlb->swp->active;
> +       }
> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
> +exit:
> +       return ret;
> +}
> +
> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
> +{
> +       if (tlb->swp_disable)
> +               return;
> +
> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> +               return;
> +
> +       if (tlb->swp_freeable) {
> +               tlb->swp_freeable =3D 0;
> +               VM_BUG_ON(tlb->swp_freeing);
> +               goto exit;
> +       }
> +       tlb->swp_freeing =3D 0;
> +       if (unlikely(!tlb->swp))
> +               goto exit;
> +
> +       __tlb_swap_gather_queuework(tlb, true);
> +exit:
> +       atomic_dec(&nr_exiting_processes);
> +}
>
>  static bool tlb_next_batch(struct mmu_gather *tlb)
>  {
> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, =
struct mm_struct *mm,
>         tlb->local.max  =3D ARRAY_SIZE(tlb->__pages);
>         tlb->active     =3D &tlb->local;
>         tlb->batch_count =3D 0;
> +
> +       tlb->swp_disable =3D 1;
> +       __tlb_swap_gather_mmu(tlb);
>  #endif
>         tlb->delayed_rmap =3D 0;
>
> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
>         tlb_batch_list_free(tlb);
> +       __tlb_batch_swap_finish(tlb);
>  #endif
>         dec_tlb_flush_pending(tlb->mm);
>  }
> --
> 2.39.0
>

Thanks
Barry