From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1D1DBECE564
	for <linux-mm@archiver.kernel.org>; Tue, 10 Sep 2024 10:11:41 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A92C58D0055; Tue, 10 Sep 2024 06:11:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A42D88D002B; Tue, 10 Sep 2024 06:11:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8BC8C8D0055; Tue, 10 Sep 2024 06:11:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 67ED98D002B
	for <linux-mm@kvack.org>; Tue, 10 Sep 2024 06:11:40 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 1F03F80C7D
	for <linux-mm@kvack.org>; Tue, 10 Sep 2024 10:11:40 +0000 (UTC)
X-FDA: 82548411960.14.E941720
Received: from mail-vs1-f50.google.com (mail-vs1-f50.google.com [209.85.217.50])
	by imf24.hostedemail.com (Postfix) with ESMTP id 2BA71180004
	for <linux-mm@kvack.org>; Tue, 10 Sep 2024 10:11:36 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JZ7be1JY;
	spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1725962994;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tQmJ8OiDL4jfL7sYbo0B2mlXfFD8ef2+8NRrKxSfTGw=;
	b=xHBZ2aA7eoGrKBZX5h5PlqYEXm8LaDxXHuhEigQmEl8Nimcx3sRotFrM9fI8Ce0bIJPdUa
	MvWT1Nb93SQbaPNUgEfsTPJ6CJmZqcD4B9uUFc7IEKdl5Vqoe9a2Drkc4ugrFmRWkO1HtH
	5xACPOrmke6crEkWrYi9GpdYb3wZAVM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725962994; a=rsa-sha256;
	cv=none;
	b=qwJp5jesjdaFRLBpa64xoyUylXkzkLG38jaX//h39aOt6uTAGDgjUOFHFoK7lnUXhPAFsl
	4XwxoePnhB6J5kJcSZPtfDtisPa1tCTEc7x0vHUSrZLLW8in8jo6AVMYP13fPyT1VD28FB
	fmv+FsOO3oW/BCGeghY2wTK47yJRr0c=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JZ7be1JY;
	spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-vs1-f50.google.com with SMTP id ada2fe7eead31-49bf3b4d07bso786407137.1
        for <linux-mm@kvack.org>; Tue, 10 Sep 2024 03:11:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1725963096; x=1726567896; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tQmJ8OiDL4jfL7sYbo0B2mlXfFD8ef2+8NRrKxSfTGw=;
        b=JZ7be1JYkVmd1LJfxAyuaKKP8WtjtXFHToAfnLwx2ksWAzomoAJRbK4105j+rOAmmm
         cxH9K0U/k5jm3NjByJVgOhfk5ia2ywcCF2FMNs/5fWsLd9ngGlLETfHw2LCwhF+ehgEm
         WE/Y4vLZhTVrcr6LNwglRuvC/KCMFFGU2PKx2/pTS+/uz62PyoPAFiT5UA4/s6vgk4f5
         H59dNFbdvjkUYxzaDpFJcI9dT33CyaYnoVbKrn1MNLoH66rpz47MId8bkbJ+4s6QjZGG
         L7DLItMLCkSQYkzhNTHDqE4ImUduwuw+ZV3Vs+pICvbB5FsOOnSNNk6iFD/01o8y20TW
         dE6A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725963096; x=1726567896;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=tQmJ8OiDL4jfL7sYbo0B2mlXfFD8ef2+8NRrKxSfTGw=;
        b=Ak80CIGv3vtHLauFXvXERtjddx9iwZ8OibeoMmv/Voc2Dy/xIQSTpqL4teid5cGUzm
         H7/jcPQTb9/ZzDh+py+7eaxitvQOaGq6MRN1OvkVjXulztGtqUOuYdzSxx6nuG8qlNt6
         tAMxDoqu0vbmSgT7qY80sO+dGThqvxb6nvrBBx+VbP6sFM8QkvNffuXIWSzpDZwu9qNa
         GKkTOCDYEeTSf0n5WWRLW7uPVAWfIbpxMvGCEKy6CqAc4YjisPRy9LGUCQVO3tMnMSXL
         kREKrivQ5eGM6sTjDs0AJQ6PWQY+qNVOSqMc1QLjpoc6H69wU51+2qRm4AdFpOPFHC6r
         C96w==
X-Forwarded-Encrypted: i=1; AJvYcCXb8vofhASK7hW+0yxvcPIpmmNCK4nNYKYtC5rQF0DiIlHJix/CpJLLuhYFSJ9dBkcalh8U+2HDlQ==@kvack.org
X-Gm-Message-State: AOJu0YzXtwC/3/FZHTHJJDihSpi4XrwVZi1QxKl3euyotciTI/IJXr0v
	59n4bbVp9TqjplzyytO+4E775dZ+L++g9FtA95y+lgYDxHZpHo17ylcwCDGb0f94DQgC9BbpqKc
	5lqDHlRO/S0QkOCXFEhyXCLJIFQo=
X-Google-Smtp-Source: AGHT+IEmpZVEwqZ+bhiOnzlaJwcH5S6rfnFDFZ17PVc0DDTsUpXEdy87IqJlVCZF44wEx7so4hQznj34+tyE3FZMtuE=
X-Received: by 2002:a05:6102:292a:b0:498:cdb0:1d03 with SMTP id
 ada2fe7eead31-49bde263682mr15677331137.23.1725963095757; Tue, 10 Sep 2024
 03:11:35 -0700 (PDT)
MIME-Version: 1.0
References: <20240805153639.1057-1-justinjiang@vivo.com> <20240805153639.1057-3-justinjiang@vivo.com>
 <CAGsJ_4zXtJvBdgpDs+yyEwfdJ0gy+_dgrWLF1zxMgBbaLBeiYA@mail.gmail.com>
 <400918d7-aaaf-4ccc-af8e-ab48576746d1@vivo.com> <CAGsJ_4yAeuEFbmOoAqW4FRv3x9WNtfu3TZcuXOqb7sf3Jsgd9g@mail.gmail.com>
 <1e7e24a7-e602-4654-ba3f-d3e4d1a2a65e@vivo.com> <CAGsJ_4yhSVggqh-v7AjV1v3oTpVaQkMxyY4YYT1vRd5na_tpTQ@mail.gmail.com>
 <90e13734-526f-44fb-8a5c-8e8199a5aef3@vivo.com>
In-Reply-To: <90e13734-526f-44fb-8a5c-8e8199a5aef3@vivo.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 10 Sep 2024 22:11:24 +1200
Message-ID: <CAGsJ_4wPzSJhr_gT3oeSbbRBhqORLr17S9U30zX9x1=vDC2g6g@mail.gmail.com>
Subject: Re: [PATCH v3 2/2] mm: tlb: add tlb swap entries batch async release
To: zhiguojiang <justinjiang@vivo.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Will Deacon <will@kernel.org>, 
	"Aneesh Kumar K.V" <aneesh.kumar@kernel.org>, Nick Piggin <npiggin@gmail.com>, 
	Peter Zijlstra <peterz@infradead.org>, Arnd Bergmann <arnd@arndb.de>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeel.butt@linux.dev>, 
	Muchun Song <muchun.song@linux.dev>, linux-arch@vger.kernel.org, cgroups@vger.kernel.org, 
	David Hildenbrand <david@redhat.com>, opensource.kernel@vivo.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: bcd9tuk48kacmfzzyhqd31u68c49a44t
X-Rspamd-Queue-Id: 2BA71180004
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1725963096-966063
X-HE-Meta: U2FsdGVkX1+hGt9+EXohk33IFVGZ3XR08ay2T21gSUR5nnZXsNNBe/7HCNIVvusnlFli5U0beR/lI7bvd0u5c4M+RKOPkaOrkq+j/Czq5sL6W8GE2sopoXXKIugaQmHlpeFlToA1toy5Wx6CTCmuqzdhnd0m/PHfD0UKZahrMPCX3BPz/Ty173BtPVzm/Mr9GLHQREfDyTI2IHIEG2rbBTx9oF0sETOv1ByFwiCW7mQi6UwfknRoU3OENMJ2Bj2lPQwZ+FfSkYdz2KMxUoY+NYmfyCbguyoVHf3DVuOJR44tHCvqobu5zk90AQ7Oakdef2qoC0kNIl51U4/8SFyvULnE8Sp8bEwm+ZYPopcbEDDzHF7BMevXqtmTW6TDkvRL1L9d/TCngCz/rxxPy0v8GDUu4HBNQdAq2N8KK/gXXNG3MsV13wEUuj312Jz+MwuGgbF/PQXcE+VJJ3S4cdxRoGCxvISxHn9XmTvoZRVxRfVYWQFGIN9RI/umHtl13hKT5wiEWDMn2+DjYdn9vXvGaGH1bgd/nqVxZUWd5vEHsShvqlamrX0hlI+qHjOZs3hdWKBoTQAeHhyEFwjPSqU7hSBB35+n3sDKauN+jU64G0g1RKoF5w+84bkCQmdxF20FDyk3102wJJZmQfBtwkeh6SNIFyPO5yFLz5UJeudS4QcF0PMnETMNyKU0bDajBPPCPEvML2DRs0NkkfgeqbdsNmkHa9A3Lj5N/W2XYYbgMUvXrSdBS0EMVlyq7yak4NPOlUyE3Hyg5IvTfCrvkSJzSVrdKJdLMBEYl7nqKhPMSQNt2EVhV0yDjzuQak3M3sYYdiAZRiyBrvpmaU2DszHM+GjeT7Q44H2bFFpwLgN58JJLtWLjMP+kcAoKWdi1E3uCZ0qZZvhSQn9WN54n7xbofFwW0fqmJqrNptaPxj1ruKRh+Uwsh9vb6R7x84UungxtboIwrvq4mkQ41qIuBJL
 bag/HjAb
 gOsJIfNiwOYQe+sHSsSzVOlvZTL1imU75LEAgDIz3lLPL4fV7RGRPfSTefGIle3MO4LYS1Hj3T6PLx5CE/kulagT+/S5ddyBdAr+iEj1Rq2kZV2/TXnQJVRArS8i1ROq3v06BUbwDqd0FbpkCpEBHy7eDW+8M/yNHyZiYAr2bNCXKQMrsGnAub7wpTUpHR/ZfjkmN5v94b9w9xSJPdGkbORDL+Pj2SMn3tWSLysaYoHv1W6TBYS602VkuX7AQMmhhH9eVUKomQ+PKGCRrTtx2EeNz0BiBz+JMw12p8L0dez8+mn2LTEpCoq835PhG+8enIFuuGPpsnLtiMvQltDIC7+robgDk+J4cPwSaPQxMCPSIiAeUo+xY7uFFORBp3cEBuRmeDnaz2FKiUIl4qf3Ok/Ts74URTtYBycoT
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Sep 10, 2024 at 9:22=E2=80=AFPM zhiguojiang <justinjiang@vivo.com> =
wrote:
>
>
>
> =E5=9C=A8 2024/9/10 12:18, Barry Song =E5=86=99=E9=81=93:
> > On Tue, Sep 10, 2024 at 2:39=E2=80=AFAM zhiguojiang <justinjiang@vivo.c=
om> wrote:
> >>
> >>
> >> =E5=9C=A8 2024/9/9 9:59, Barry Song =E5=86=99=E9=81=93:
> >>> On Wed, Sep 4, 2024 at 11:26=E2=80=AFPM zhiguojiang <justinjiang@vivo=
.com> wrote:
> >>>>
> >>>> =E5=9C=A8 2024/9/4 17:16, Barry Song =E5=86=99=E9=81=93:
> >>>>> On Tue, Aug 6, 2024 at 3:36=E2=80=AFAM Zhiguo Jiang <justinjiang@vi=
vo.com> wrote:
> >>>>>> One of the main reasons for the prolonged exit of the process with
> >>>>>> independent mm is the time-consuming release of its swap entries.
> >>>>>> The proportion of swap memory occupied by the process increases ov=
er
> >>>>>> time due to high memory pressure triggering to reclaim anonymous f=
olio
> >>>>>> into swapspace, e.g., in Android devices, we found this proportion=
 can
> >>>>>> reach 60% or more after a period of time. Additionally, the relati=
vely
> >>>>>> lengthy path for releasing swap entries further contributes to the
> >>>>>> longer time required to release swap entries.
> >>>>>>
> >>>>>> Testing Platform: 8GB RAM
> >>>>>> Testing procedure:
> >>>>>> After booting up, start 15 processes first, and then observe the
> >>>>>> physical memory size occupied by the last launched process at diff=
erent
> >>>>>> time points.
> >>>>>> Example: The process launched last: com.qiyi.video
> >>>>>> |  memory type  |  0min  |  1min  |   5min  |   10min  |   15min  =
|
> >>>>>> ------------------------------------------------------------------=
-
> >>>>>> |     VmRSS(KB) | 453832 | 252300 |  204364 |   199944 |  199748  =
|
> >>>>>> |   RssAnon(KB) | 247348 |  99296 |   71268 |    67808 |   67660  =
|
> >>>>>> |   RssFile(KB) | 205536 | 152020 |  132144 |   131184 |  131136  =
|
> >>>>>> |  RssShmem(KB) |   1048 |    984 |     952 |     952  |     952  =
|
> >>>>>> |    VmSwap(KB) | 202692 | 334852 |  362880 |   366340 |  366488  =
|
> >>>>>> | Swap ratio(%) | 30.87% | 57.03% |  63.97% |   64.69% |  64.72%  =
|
> >>>>>> Note: min - minute.
> >>>>>>
> >>>>>> When there are multiple processes with independent mm and the high
> >>>>>> memory pressure in system, if the large memory required process is
> >>>>>> launched at this time, system will is likely to trigger the instan=
taneous
> >>>>>> killing of many processes with independent mm. Due to multiple exi=
ting
> >>>>>> processes occupying multiple CPU core resources for concurrent exe=
cution,
> >>>>>> leading to some issues such as the current non-exiting and importa=
nt
> >>>>>> processes lagging.
> >>>>>>
> >>>>>> To solve this problem, we have introduced the multiple exiting pro=
cess
> >>>>>> asynchronous swap entries release mechanism, which isolates and ca=
ches
> >>>>>> swap entries occupied by multiple exiting processes, and hands the=
m over
> >>>>>> to an asynchronous kworker to complete the release. This allows th=
e
> >>>>>> exiting processes to complete quickly and release CPU resources. W=
e have
> >>>>>> validated this modification on the Android products and achieved t=
he
> >>>>>> expected benefits.
> >>>>>>
> >>>>>> Testing Platform: 8GB RAM
> >>>>>> Testing procedure:
> >>>>>> After restarting the machine, start 15 app processes first, and th=
en
> >>>>>> start the camera app processes, we monitor the cold start and prev=
iew
> >>>>>> time datas of the camera app processes.
> >>>>>>
> >>>>>> Test datas of camera processes cold start time (unit: millisecond)=
:
> >>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>>>> | before | 1498 | 1476 | 1741 | 1337 | 1367 | 1655 |   1512  |
> >>>>>> | after  | 1396 | 1107 | 1136 | 1178 | 1071 | 1339 |   1204  |
> >>>>>>
> >>>>>> Test datas of camera processes preview time (unit: millisecond):
> >>>>>> |  seq   |   1  |   2  |   3  |   4  |   5  |   6  | average |
> >>>>>> | before |  267 |  402 |  504 |  513 |  161 |  265 |   352   |
> >>>>>> | after  |  188 |  223 |  301 |  203 |  162 |  154 |   205   |
> >>>>>>
> >>>>>> Base on the average of the six sets of test datas above, we can se=
e that
> >>>>>> the benefit datas of the modified patch:
> >>>>>> 1. The cold start time of camera app processes has reduced by abou=
t 20%.
> >>>>>> 2. The preview time of camera app processes has reduced by about 4=
2%.
> >>>>>>
> >>>>>> It offers several benefits:
> >>>>>> 1. Alleviate the high system cpu loading caused by multiple exitin=
g
> >>>>>>       processes running simultaneously.
> >>>>>> 2. Reduce lock competition in swap entry free path by an asynchron=
ous
> >>>>>>       kworker instead of multiple exiting processes parallel execu=
tion.
> >>>>>> 3. Release pte_present memory occupied by exiting processes more
> >>>>>>       efficiently.
> >>>>>>
> >>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> >>>>>> ---
> >>>>>>     arch/s390/include/asm/tlb.h |   8 +
> >>>>>>     include/asm-generic/tlb.h   |  44 ++++++
> >>>>>>     include/linux/mm_types.h    |  58 +++++++
> >>>>>>     mm/memory.c                 |   3 +-
> >>>>>>     mm/mmu_gather.c             | 296 ++++++++++++++++++++++++++++=
++++++++
> >>>>>>     5 files changed, 408 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/t=
lb.h
> >>>>>> index e95b2c8081eb..3f681f63390f
> >>>>>> --- a/arch/s390/include/asm/tlb.h
> >>>>>> +++ b/arch/s390/include/asm/tlb.h
> >>>>>> @@ -28,6 +28,8 @@ static inline bool __tlb_remove_page_size(struct=
 mmu_gather *tlb,
> >>>>>>                    struct page *page, bool delay_rmap, int page_si=
ze);
> >>>>>>     static inline bool __tlb_remove_folio_pages(struct mmu_gather =
*tlb,
> >>>>>>                    struct page *page, unsigned int nr_pages, bool =
delay_rmap);
> >>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *t=
lb,
> >>>>>> +               swp_entry_t entry, int nr);
> >>>>>>
> >>>>>>     #define tlb_flush tlb_flush
> >>>>>>     #define pte_free_tlb pte_free_tlb
> >>>>>> @@ -69,6 +71,12 @@ static inline bool __tlb_remove_folio_pages(str=
uct mmu_gather *tlb,
> >>>>>>            return false;
> >>>>>>     }
> >>>>>>
> >>>>>> +static inline bool __tlb_remove_swap_entries(struct mmu_gather *t=
lb,
> >>>>>> +               swp_entry_t entry, int nr)
> >>>>>> +{
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>     static inline void tlb_flush(struct mmu_gather *tlb)
> >>>>>>     {
> >>>>>>            __tlb_flush_mm_lazy(tlb->mm);
> >>>>>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> >>>>>> index 709830274b75..8b4d516b35b8
> >>>>>> --- a/include/asm-generic/tlb.h
> >>>>>> +++ b/include/asm-generic/tlb.h
> >>>>>> @@ -294,6 +294,37 @@ extern void tlb_flush_rmaps(struct mmu_gather=
 *tlb, struct vm_area_struct *vma);
> >>>>>>     static inline void tlb_flush_rmaps(struct mmu_gather *tlb, str=
uct vm_area_struct *vma) { }
> >>>>>>     #endif
> >>>>>>
> >>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +struct mmu_swap_batch {
> >>>>>> +       struct mmu_swap_batch *next;
> >>>>>> +       unsigned int nr;
> >>>>>> +       unsigned int max;
> >>>>>> +       encoded_swpentry_t encoded_entrys[];
> >>>>>> +};
> >>>>>> +
> >>>>>> +#define MAX_SWAP_GATHER_BATCH  \
> >>>>>> +       ((PAGE_SIZE - sizeof(struct mmu_swap_batch)) / sizeof(void=
 *))
> >>>>>> +
> >>>>>> +#define MAX_SWAP_GATHER_BATCH_COUNT    (10000UL / MAX_SWAP_GATHER=
_BATCH)
> >>>>>> +
> >>>>>> +struct mmu_swap_gather {
> >>>>>> +       /*
> >>>>>> +        * the asynchronous kworker to batch
> >>>>>> +        * release swap entries
> >>>>>> +        */
> >>>>>> +       struct work_struct free_work;
> >>>>>> +
> >>>>>> +       /* batch cache swap entries */
> >>>>>> +       unsigned int batch_count;
> >>>>>> +       struct mmu_swap_batch *active;
> >>>>>> +       struct mmu_swap_batch local;
> >>>>>> +       encoded_swpentry_t __encoded_entrys[MMU_GATHER_BUNDLE];
> >>>>>> +};
> >>>>>> +
> >>>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>>>> +               swp_entry_t entry, int nr);
> >>>>>> +#endif
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * struct mmu_gather is an opaque type used by the mm code for=
 passing around
> >>>>>>      * any data needed by arch specific code for tlb_remove_page.
> >>>>>> @@ -343,6 +374,18 @@ struct mmu_gather {
> >>>>>>            unsigned int            vma_exec : 1;
> >>>>>>            unsigned int            vma_huge : 1;
> >>>>>>            unsigned int            vma_pfn  : 1;
> >>>>>> +#ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +       /*
> >>>>>> +        * Two states of releasing swap entries
> >>>>>> +        * asynchronously:
> >>>>>> +        * swp_freeable - have opportunity to
> >>>>>> +        * release asynchronously future
> >>>>>> +        * swp_freeing - be releasing asynchronously.
> >>>>>> +        */
> >>>>>> +       unsigned int            swp_freeable : 1;
> >>>>>> +       unsigned int            swp_freeing : 1;
> >>>>>> +       unsigned int            swp_disable : 1;
> >>>>>> +#endif
> >>>>>>
> >>>>>>            unsigned int            batch_count;
> >>>>>>
> >>>>>> @@ -354,6 +397,7 @@ struct mmu_gather {
> >>>>>>     #ifdef CONFIG_MMU_GATHER_PAGE_SIZE
> >>>>>>            unsigned int page_size;
> >>>>>>     #endif
> >>>>>> +       struct mmu_swap_gather *swp;
> >>>>>>     #endif
> >>>>>>     };
> >>>>>>
> >>>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >>>>>> index 165c58b12ccc..2f66303f1519
> >>>>>> --- a/include/linux/mm_types.h
> >>>>>> +++ b/include/linux/mm_types.h
> >>>>>> @@ -283,6 +283,64 @@ typedef struct {
> >>>>>>            unsigned long val;
> >>>>>>     } swp_entry_t;
> >>>>>>
> >>>>>> +/*
> >>>>>> + * encoded_swpentry_t - a type marking the encoded swp_entry_t.
> >>>>>> + *
> >>>>>> + * An 'encoded_swpentry_t' represents a 'swp_enrty_t' with its th=
e highest
> >>>>>> + * bit indicating extra context-dependent information. Only used =
in swp_entry
> >>>>>> + * asynchronous release path by mmu_swap_gather.
> >>>>>> + */
> >>>>>> +typedef struct {
> >>>>>> +       unsigned long val;
> >>>>>> +} encoded_swpentry_t;
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * The next item in an encoded_swpentry_t array is the "nr" argum=
ent, specifying the
> >>>>>> + * total number of consecutive swap entries associated with the s=
ame folio. If this
> >>>>>> + * bit is not set, "nr" is implicitly 1.
> >>>>>> + *
> >>>>>> + * Refer to include\asm\pgtable.h, swp_offset bits: 0 ~ 57, swp_t=
ype bits: 58 ~ 62.
> >>>>>> + * Bit63 can be used here.
> >>>>>> + */
> >>>>>> +#define ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT (1UL << (BITS_PER_LON=
G - 1))
> >>>>>> +
> >>>>>> +static __always_inline encoded_swpentry_t
> >>>>>> +encode_swpentry(swp_entry_t entry, unsigned long flags)
> >>>>>> +{
> >>>>>> +       encoded_swpentry_t ret;
> >>>>>> +
> >>>>>> +       VM_WARN_ON_ONCE(flags & ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NE=
XT);
> >>>>>> +       ret.val =3D flags | entry.val;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline unsigned long encoded_swpentry_flags(encoded_swpent=
ry_t entry)
> >>>>>> +{
> >>>>>> +       return ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.val;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline swp_entry_t encoded_swpentry_data(encoded_swpentry_=
t entry)
> >>>>>> +{
> >>>>>> +       swp_entry_t ret;
> >>>>>> +
> >>>>>> +       ret.val =3D ~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT & entry.v=
al;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static __always_inline encoded_swpentry_t encode_nr_swpentrys(uns=
igned long nr)
> >>>>>> +{
> >>>>>> +       encoded_swpentry_t ret;
> >>>>>> +
> >>>>>> +       VM_WARN_ON_ONCE(nr & ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT);
> >>>>>> +       ret.val =3D nr;
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static __always_inline unsigned long encoded_nr_swpentrys(encoded=
_swpentry_t entry)
> >>>>>> +{
> >>>>>> +       return ((~ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT) & entry.val=
);
> >>>>>> +}
> >>>>>> +
> >>>>>>     /**
> >>>>>>      * struct folio - Represents a contiguous set of bytes.
> >>>>>>      * @flags: Identical to the page flags.
> >>>>>> diff --git a/mm/memory.c b/mm/memory.c
> >>>>>> index d6a9dcddaca4..023a8adcb67c
> >>>>>> --- a/mm/memory.c
> >>>>>> +++ b/mm/memory.c
> >>>>>> @@ -1650,7 +1650,8 @@ static unsigned long zap_pte_range(struct mm=
u_gather *tlb,
> >>>>>>                            if (!should_zap_cows(details))
> >>>>>>                                    continue;
> >>>>>>                            rss[MM_SWAPENTS] -=3D nr;
> >>>>>> -                       free_swap_and_cache_nr(entry, nr);
> >>>>>> +                       if (!__tlb_remove_swap_entries(tlb, entry,=
 nr))
> >>>>>> +                               free_swap_and_cache_nr(entry, nr);
> >>>>>>                    } else if (is_migration_entry(entry)) {
> >>>>>>                            folio =3D pfn_swap_entry_folio(entry);
> >>>>>>                            if (!should_zap_folio(details, folio))
> >>>>>> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >>>>>> index 99b3e9408aa0..33dc9d1faff9
> >>>>>> --- a/mm/mmu_gather.c
> >>>>>> +++ b/mm/mmu_gather.c
> >>>>>> @@ -9,11 +9,303 @@
> >>>>>>     #include <linux/smp.h>
> >>>>>>     #include <linux/swap.h>
> >>>>>>     #include <linux/rmap.h>
> >>>>>> +#include <linux/oom.h>
> >>>>>>
> >>>>>>     #include <asm/pgalloc.h>
> >>>>>>     #include <asm/tlb.h>
> >>>>>>
> >>>>>>     #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>> +/*
> >>>>>> + * The swp_entry asynchronous release mechanism for multiple proc=
esses with
> >>>>>> + * independent mm exiting simultaneously.
> >>>>>> + *
> >>>>>> + * During the multiple exiting processes releasing their own mm s=
imultaneously,
> >>>>>> + * the swap entries in the exiting processes are handled by isola=
ting, caching
> >>>>>> + * and handing over to an asynchronous kworker to complete the re=
lease.
> >>>>>> + *
> >>>>>> + * The conditions for the exiting process entering the swp_entry =
asynchronous
> >>>>>> + * release path:
> >>>>>> + * 1. The exiting process's MM_SWAPENTS count is >=3D SWAP_CLUSTE=
R_MAX, avoiding
> >>>>>> + *    to alloc struct mmu_swap_gather frequently.
> >>>>>> + * 2. The number of exiting processes is >=3D NR_MIN_EXITING_PROC=
ESSES.
> >>>>> Hi Zhiguo,
> >>>>>
> >>>>> I'm curious about the significance of NR_MIN_EXITING_PROCESSES. It =
seems that
> >>>>> batched swap entry freeing, even with one process, could be a
> >>>>> bottleneck for a single
> >>>>> process based on the data from this patch:
> >>>>>
> >>>>> mm: attempt to batch free swap entries for zap_pte_range()
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/=
?h=3Dmm-stable&id=3Dbea67dcc5ee
> >>>>> "munmap bandwidth becomes 3X faster."
> >>>>>
> >>>>> So what would happen if you simply set NR_MIN_EXITING_PROCESSES to =
1?
> >>>> Hi Barry,
> >>>>
> >>>> Thanks for your comments.
> >>>>
> >>>> The reason for NR_MIN_EXITING_PROCESSES =3D 2 is that previously we
> >>>> conducted the multiple android apps continuous startup performance
> >>>> test on the case of NR_MIN_EXITING_PROCESSES =3D 1, and the results
> >>>> showed that the startup time had deteriorated slightly. However,
> >>>> the patch's logic in this test was different from the one I submitte=
d
> >>>> to the community, and it may be due to some other issues with the
> >>>> previous old patch.
> >>>>
> >>>> However, we have conducted relevant memory performance tests on this
> >>>> patches submitted to the community (NR_MIN_EXITING_PROCESSES=3D2), a=
nd
> >>>> the results are better than before the modification. The patches hav=
e
> >>>> been used on multiple projects.
> >>>> For example:
> >>>> Test the time consumption and subjective fluency experience of
> >>>> launching 30 android apps continuously for two rounds.
> >>>> Test machine: RAM 16GB
> >>>> |        | time(s) | Frame-droped rate(%) |
> >>>> | before | 230.76  |         0.54         |
> >>>> | after  | 225.23  |         0.74         |
> >>>> We can see that the patch has been optimized 5.53s for startup time =
and
> >>>> 0.2% frame-droped rate and better subjective smoothness experience.
> >>>>
> >>>> Perhaps the patches submitted to the community has also improved the
> >>>> multiple android apps continuous startup performance in the case
> >>>> of NR_MIN_EXITING_PROCESSES=3D1. If necessary, I will conduct releva=
nt
> >>>> tests to verify this situation in the future.
> >> Thanks Barry for your valuable suggestions.
> >>> Using a fixed value like 2 feels more like a workaround than a solid =
solution.
> >>> It would be better if we could eliminate this hack.
> >> Ok, I will conduct more tests for tring to solve this parameter issue.
> >>> Additionally, this type of asynchronous reclamation might struggle to=
 scale
> >>> effectively, particularly on NUMA systems with many CPU cores.
> >>>
> >>> Many kernel threads are per-node, like kswapd. For instance, if we ha=
ve 100
> >>> threads running on 100 CPUs executing zap_pte_range(), your approach,=
 which
> >>> relies on a single async thread to reclaim swap entries, might lead t=
o
> >>> performance
> >>> regressions.
> >>>
> >>> We might need to consider a more adaptable approach that can evaluate
> >>> the machine's
> >>> topology and dynamically determine the appropriate number of async
> >>> threads, rather
> >>> than hard-coding it to just one. Otherwise, there could be ongoing
> >>> concerns about
> >>> whether this solution is truly applicable to all systems.
> >> Can we dynamically determine the number of asynchronous kworkers to be
> >> created based on the number of exiting processes at a certain moment,
> >> for example, every 8 exiting processes corresponds to one asynchronous
> >> kworker?
> >>
> >> | The number of exiting processes | The number of asynchronous kworker=
s |
> >> |          [1, 8]                 | 1                    |
> >> |          [9, 16]                | 2                    |
> >> |          ...                    | ...                   |
> >> |      [8*N+1, 8*(N+1)]           | N+1                   |
> >> N >=3D 0
> > I'm not entirely sure. Another potential approach could be to use a
> > dedicated thread
> > for each NUMA node, but we would need data to confirm if this is the
> > right solution.
> I feel that this may be a feasible async approach, where the async
> kworker of each NUMA node is responsible for maintaining and releasing
> swap entries mapped by the exiting processes executing on the
> corresponding local CPUs. Not sure how many CPUs each NUMA node can
> contain? However, I currently do not have a NUMA environment for this
> testing verification.

Yes, I get it. As someone working on phones, we don=E2=80=99t have NUMA mac=
hines,
but our code still impacts them. That=E2=80=99s what makes things so tricky=
:-)

> >>> Alternatively, we might be able to develop a method to speed up
> >>> batched freeing in a
> >>> synchronous manner after collecting the mmu_swap_batch. mmu_gather is=
n't
> >>> async, but it can still speed up tlb flush, right?
> >> The synchronous manner may have some optimization effect, but it seems
> >> that the optimization effect on the CPUs load occupied by multiple
> >> exiting processes is not significant. In addition, David stated in the
> >> latest comment that swap entry seems unrelated to tlb.
> >>> For phones with just 8 CPU cores, I definitely like your patch.
> >>> However, since we're
> >>> aiming for something which can affect all systems, the situation migh=
t be more
> >>> complex.
> >> Yes, this pacth still needs further improvement to adapt to all system=
s.
> > If we want to keep things simple, another approach might be to profile =
the
> > hotspots within swap_free_nr(), identifying the most time-consuming par=
ts,
> > and then optimize them based on the data we collect from your
> > mmu_swap_batch. Then we don't have to make things async. Have you ever
> > profiled the hotspots?
> The hotspots seems to have been implemented in the __swap_entries_free()
> in the patches you submitted, releasing continuous swap entries at once.
> Moreover, there may be duplication between the mmu_swap_batch and
> swapsots_cache.=F0=9F=99=82

I assume you're referring to how __swap_entries_free() in the patch "mm:
attempt to batch free swap entries for zap_pte_range()" addresses the
hotspots, potentially eliminating the swap entry release as a bottleneck on=
ce
we have mTHP. However, for platforms that don=E2=80=99t use mTHP, is this s=
till a
significant issue?

Have you ever profiled the code and identified the exact bottleneck? If so,
we can brainstorm ideas to address it together. Honestly, I wasn=E2=80=99t =
sure which
specific line was causing swap_free to be so slow, but __swap_entries_free(=
)
significantly improves performance for the mTHP case.

> >
> >>>>>> + *
> >>>>>> + * Since the time for determining the number of exiting processes=
 is dynamic,
> >>>>>> + * the exiting process may start to enter the swp_entry asynchron=
ous release
> >>>>>> + * at the beginning or middle stage of the exiting process's swp_=
entry release
> >>>>>> + * path.
> >>>>>> + *
> >>>>>> + * Once an exiting process enters the swp_entry asynchronous rele=
ase, all remaining
> >>>>>> + * swap entries in this exiting process need to be fully released=
 by asynchronous
> >>>>>> + * kworker theoretically.
> >>>>> Freeing a slot can indeed release memory from `zRAM`, potentially r=
eturning
> >>>>> it to the system for allocation. Your patch frees swap slots asynch=
ronously;
> >>>>> I assume this doesn=E2=80=99t slow down the memory freeing process =
for `zRAM`, or
> >>>>> could it even slow down the freeing of `zRAM` memory? Freeing compr=
essed
> >>>>> memory might not be as crucial compared to freeing uncompressed mem=
ory with
> >>>>> present PTEs?
> >>>> Yes, freeing uncompressed memory with present PTEs is more important
> >>>> compared to freeing compressed 'zRAM' memory.
> >>>>
> >>>> I guess that the multiple exiting processes releasing swap entries
> >>>> simultaneously may result in the swap_info->lock competition pressur=
e
> >>>> in swapcache_free_entries(), affecting the efficiency of releasing s=
wap
> >>>> entries. However, if the asynchronous kworker is used, this issue ca=
n
> >>>> be avoided, and perhaps the improvement is minor.
> >>>>
> >>>> The freeing of zRAM memory does not slow down. We have observed trac=
es
> >>>> in the camera startup scene and found that the asynchronous kworker
> >>>> can release all swap entries before entering the camera preview.
> >>>> Compared to not using the asynchronous kworker, the exiting processe=
s
> >>>> completed after entering the camera preview.
> >>>>>> + *
> >>>>>> + * The function of the swp_entry asynchronous release:
> >>>>>> + * 1. Alleviate the high system cpu load caused by multiple exiti=
ng processes
> >>>>>> + *    running simultaneously.
> >>>>>> + * 2. Reduce lock competition in swap entry free path by an async=
hronous kworker
> >>>>>> + *    instead of multiple exiting processes parallel execution.
> >>>>>> + * 3. Release pte_present memory occupied by exiting processes mo=
re efficiently.
> >>>>>> + */
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * The min number of exiting processes required for swp_entry asy=
nchronous release
> >>>>>> + */
> >>>>>> +#define NR_MIN_EXITING_PROCESSES 2
> >>>>>> +
> >>>>>> +static atomic_t nr_exiting_processes =3D ATOMIC_INIT(0);
> >>>>>> +static struct kmem_cache *swap_gather_cachep;
> >>>>>> +static struct workqueue_struct *swapfree_wq;
> >>>>>> +static DEFINE_STATIC_KEY_TRUE(tlb_swap_asyncfree_disabled);
> >>>>>> +
> >>>>>> +static int __init tlb_swap_async_free_setup(void)
> >>>>>> +{
> >>>>>> +       swapfree_wq =3D alloc_workqueue("smfree_wq", WQ_UNBOUND |
> >>>>>> +               WQ_HIGHPRI | WQ_MEM_RECLAIM, 1);
> >>>>>> +       if (!swapfree_wq)
> >>>>>> +               goto fail;
> >>>>>> +
> >>>>>> +       swap_gather_cachep =3D kmem_cache_create("swap_gather",
> >>>>>> +               sizeof(struct mmu_swap_gather),
> >>>>>> +               0, SLAB_TYPESAFE_BY_RCU | SLAB_PANIC | SLAB_ACCOUN=
T,
> >>>>>> +               NULL);
> >>>>>> +       if (!swap_gather_cachep)
> >>>>>> +               goto kcache_fail;
> >>>>>> +
> >>>>>> +       static_branch_disable(&tlb_swap_asyncfree_disabled);
> >>>>>> +       return 0;
> >>>>>> +
> >>>>>> +kcache_fail:
> >>>>>> +       destroy_workqueue(swapfree_wq);
> >>>>>> +fail:
> >>>>>> +       return -ENOMEM;
> >>>>>> +}
> >>>>>> +postcore_initcall(tlb_swap_async_free_setup);
> >>>>>> +
> >>>>>> +static void __tlb_swap_gather_free(struct mmu_swap_gather *swap_g=
ather)
> >>>>>> +{
> >>>>>> +       struct mmu_swap_batch *swap_batch, *next;
> >>>>>> +
> >>>>>> +       for (swap_batch =3D swap_gather->local.next; swap_batch; s=
wap_batch =3D next) {
> >>>>>> +               next =3D swap_batch->next;
> >>>>>> +               free_page((unsigned long)swap_batch);
> >>>>>> +       }
> >>>>>> +       swap_gather->local.next =3D NULL;
> >>>>>> +       kmem_cache_free(swap_gather_cachep, swap_gather);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void tlb_swap_async_free_work(struct work_struct *w)
> >>>>>> +{
> >>>>>> +       int i, nr_multi, nr_free;
> >>>>>> +       swp_entry_t start_entry;
> >>>>>> +       struct mmu_swap_batch *swap_batch;
> >>>>>> +       struct mmu_swap_gather *swap_gather =3D container_of(w,
> >>>>>> +               struct mmu_swap_gather, free_work);
> >>>>>> +
> >>>>>> +       /* Release swap entries cached in mmu_swap_batch. */
> >>>>>> +       for (swap_batch =3D &swap_gather->local; swap_batch && swa=
p_batch->nr;
> >>>>>> +           swap_batch =3D swap_batch->next) {
> >>>>>> +               nr_free =3D 0;
> >>>>>> +               for (i =3D 0; i < swap_batch->nr; i++) {
> >>>>>> +                       if (unlikely(encoded_swpentry_flags(swap_b=
atch->encoded_entrys[i]) &
> >>>>>> +                           ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT)) =
{
> >>>>>> +                               start_entry =3D encoded_swpentry_d=
ata(swap_batch->encoded_entrys[i]);
> >>>>>> +                               nr_multi =3D encoded_nr_swpentrys(=
swap_batch->encoded_entrys[++i]);
> >>>>>> +                               free_swap_and_cache_nr(start_entry=
, nr_multi);
> >>>>>> +                               nr_free +=3D 2;
> >>>>>> +                       } else {
> >>>>>> +                               start_entry =3D encoded_swpentry_d=
ata(swap_batch->encoded_entrys[i]);
> >>>>>> +                               free_swap_and_cache_nr(start_entry=
, 1);
> >>>>>> +                               nr_free++;
> >>>>>> +                       }
> >>>>>> +               }
> >>>>>> +               swap_batch->nr -=3D nr_free;
> >>>>>> +               VM_BUG_ON(swap_batch->nr);
> >>>>>> +       }
> >>>>>> +       __tlb_swap_gather_free(swap_gather);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __tlb_swap_gather_mmu_check(struct mmu_gather *tlb)
> >>>>>> +{
> >>>>>> +       /*
> >>>>>> +        * Only the exiting processes with the MM_SWAPENTS counter=
 >=3D
> >>>>>> +        * SWAP_CLUSTER_MAX have the opportunity to release their =
swap
> >>>>>> +        * entries by asynchronous kworker.
> >>>>>> +        */
> >>>>>> +       if (!task_is_dying() ||
> >>>>>> +           get_mm_counter(tlb->mm, MM_SWAPENTS) < SWAP_CLUSTER_MA=
X)
> >>>>>> +               return true;
> >>>>>> +
> >>>>>> +       atomic_inc(&nr_exiting_processes);
> >>>>>> +       if (atomic_read(&nr_exiting_processes) < NR_MIN_EXITING_PR=
OCESSES)
> >>>>>> +               tlb->swp_freeable =3D 1;
> >>>>>> +       else
> >>>>>> +               tlb->swp_freeing =3D 1;
> >>>>>> +
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * __tlb_swap_gather_init - Initialize an mmu_swap_gather structu=
re
> >>>>>> + * for swp_entry tear-down.
> >>>>>> + * @tlb: the mmu_swap_gather structure belongs to tlb
> >>>>>> + */
> >>>>>> +static bool __tlb_swap_gather_init(struct mmu_gather *tlb)
> >>>>>> +{
> >>>>>> +       tlb->swp =3D kmem_cache_alloc(swap_gather_cachep, GFP_ATOM=
IC | GFP_NOWAIT);
> >>>>>> +       if (unlikely(!tlb->swp))
> >>>>>> +               return false;
> >>>>>> +
> >>>>>> +       tlb->swp->local.next  =3D NULL;
> >>>>>> +       tlb->swp->local.nr    =3D 0;
> >>>>>> +       tlb->swp->local.max   =3D ARRAY_SIZE(tlb->swp->__encoded_e=
ntrys);
> >>>>>> +
> >>>>>> +       tlb->swp->active      =3D &tlb->swp->local;
> >>>>>> +       tlb->swp->batch_count =3D 0;
> >>>>>> +
> >>>>>> +       INIT_WORK(&tlb->swp->free_work, tlb_swap_async_free_work);
> >>>>>> +       return true;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __tlb_swap_gather_mmu(struct mmu_gather *tlb)
> >>>>>> +{
> >>>>>> +       if (static_branch_unlikely(&tlb_swap_asyncfree_disabled))
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       tlb->swp =3D NULL;
> >>>>>> +       tlb->swp_freeable =3D 0;
> >>>>>> +       tlb->swp_freeing =3D 0;
> >>>>>> +       tlb->swp_disable =3D 0;
> >>>>>> +
> >>>>>> +       if (__tlb_swap_gather_mmu_check(tlb))
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       /*
> >>>>>> +        * If the exiting process meets the conditions of
> >>>>>> +        * swp_entry asynchronous release, an mmu_swap_gather
> >>>>>> +        * structure will be initialized.
> >>>>>> +        */
> >>>>>> +       if (tlb->swp_freeing)
> >>>>>> +               __tlb_swap_gather_init(tlb);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __tlb_swap_gather_queuework(struct mmu_gather *tlb, b=
ool finish)
> >>>>>> +{
> >>>>>> +       queue_work(swapfree_wq, &tlb->swp->free_work);
> >>>>>> +       tlb->swp =3D NULL;
> >>>>>> +       if (!finish)
> >>>>>> +               __tlb_swap_gather_init(tlb);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static bool __tlb_swap_next_batch(struct mmu_gather *tlb)
> >>>>>> +{
> >>>>>> +       struct mmu_swap_batch *swap_batch;
> >>>>>> +
> >>>>>> +       if (tlb->swp->batch_count =3D=3D MAX_SWAP_GATHER_BATCH_COU=
NT)
> >>>>>> +               goto free;
> >>>>>> +
> >>>>>> +       swap_batch =3D (void *)__get_free_page(GFP_ATOMIC | GFP_NO=
WAIT);
> >>>>>> +       if (unlikely(!swap_batch))
> >>>>>> +               goto free;
> >>>>>> +
> >>>>>> +       swap_batch->next =3D NULL;
> >>>>>> +       swap_batch->nr   =3D 0;
> >>>>>> +       swap_batch->max  =3D MAX_SWAP_GATHER_BATCH;
> >>>>>> +
> >>>>>> +       tlb->swp->active->next =3D swap_batch;
> >>>>>> +       tlb->swp->active =3D swap_batch;
> >>>>>> +       tlb->swp->batch_count++;
> >>>>>> +       return true;
> >>>>>> +free:
> >>>>>> +       /* batch move to wq */
> >>>>>> +       __tlb_swap_gather_queuework(tlb, false);
> >>>>>> +       return false;
> >>>>>> +}
> >>>>>> +
> >>>>>> +/**
> >>>>>> + * __tlb_remove_swap_entries - the swap entries in exiting proces=
s are
> >>>>>> + * isolated, batch cached in struct mmu_swap_batch.
> >>>>>> + * @tlb: the current mmu_gather
> >>>>>> + * @entry: swp_entry to be isolated and cached
> >>>>>> + * @nr: the number of consecutive entries starting from entry par=
ameter.
> >>>>>> + */
> >>>>>> +bool __tlb_remove_swap_entries(struct mmu_gather *tlb,
> >>>>>> +                            swp_entry_t entry, int nr)
> >>>>>> +{
> >>>>>> +       struct mmu_swap_batch *swap_batch;
> >>>>>> +       unsigned long flags =3D 0;
> >>>>>> +       bool ret =3D false;
> >>>>>> +
> >>>>>> +       if (tlb->swp_disable)
> >>>>>> +               return ret;
> >>>>>> +
> >>>>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >>>>>> +               return ret;
> >>>>>> +
> >>>>>> +       if (tlb->swp_freeable) {
> >>>>>> +               if (atomic_read(&nr_exiting_processes) <
> >>>>>> +                   NR_MIN_EXITING_PROCESSES)
> >>>>>> +                       return ret;
> >>>>>> +               /*
> >>>>>> +                * If the current number of exiting processes
> >>>>>> +                * is >=3D NR_MIN_EXITING_PROCESSES, the exiting
> >>>>>> +                * process with swp_freeable state will enter
> >>>>>> +                * swp_freeing state to start releasing its
> >>>>>> +                * remaining swap entries by the asynchronous
> >>>>>> +                * kworker.
> >>>>>> +                */
> >>>>>> +               tlb->swp_freeable =3D 0;
> >>>>>> +               tlb->swp_freeing =3D 1;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       VM_BUG_ON(tlb->swp_freeable || !tlb->swp_freeing);
> >>>>>> +       if (!tlb->swp && !__tlb_swap_gather_init(tlb))
> >>>>>> +               return ret;
> >>>>>> +
> >>>>>> +       swap_batch =3D tlb->swp->active;
> >>>>>> +       if (unlikely(swap_batch->nr >=3D swap_batch->max - 1)) {
> >>>>>> +               __tlb_swap_gather_queuework(tlb, false);
> >>>>>> +               return ret;
> >>>>>> +       }
> >>>>>> +
> >>>>>> +       if (likely(nr =3D=3D 1)) {
> >>>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D e=
ncode_swpentry(entry, flags);
> >>>>>> +       } else {
> >>>>>> +               flags |=3D ENCODED_SWPENTRY_BIT_NR_ENTRYS_NEXT;
> >>>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D e=
ncode_swpentry(entry, flags);
> >>>>>> +               swap_batch->encoded_entrys[swap_batch->nr++] =3D e=
ncode_nr_swpentrys(nr);
> >>>>>> +       }
> >>>>>> +       ret =3D true;
> >>>>>> +
> >>>>>> +       if (swap_batch->nr >=3D swap_batch->max - 1) {
> >>>>>> +               if (!__tlb_swap_next_batch(tlb))
> >>>>>> +                       goto exit;
> >>>>>> +               swap_batch =3D tlb->swp->active;
> >>>>>> +       }
> >>>>>> +       VM_BUG_ON(swap_batch->nr > swap_batch->max - 1);
> >>>>>> +exit:
> >>>>>> +       return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void __tlb_batch_swap_finish(struct mmu_gather *tlb)
> >>>>>> +{
> >>>>>> +       if (tlb->swp_disable)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       if (!tlb->swp_freeable && !tlb->swp_freeing)
> >>>>>> +               return;
> >>>>>> +
> >>>>>> +       if (tlb->swp_freeable) {
> >>>>>> +               tlb->swp_freeable =3D 0;
> >>>>>> +               VM_BUG_ON(tlb->swp_freeing);
> >>>>>> +               goto exit;
> >>>>>> +       }
> >>>>>> +       tlb->swp_freeing =3D 0;
> >>>>>> +       if (unlikely(!tlb->swp))
> >>>>>> +               goto exit;
> >>>>>> +
> >>>>>> +       __tlb_swap_gather_queuework(tlb, true);
> >>>>>> +exit:
> >>>>>> +       atomic_dec(&nr_exiting_processes);
> >>>>>> +}
> >>>>>>
> >>>>>>     static bool tlb_next_batch(struct mmu_gather *tlb)
> >>>>>>     {
> >>>>>> @@ -386,6 +678,9 @@ static void __tlb_gather_mmu(struct mmu_gather=
 *tlb, struct mm_struct *mm,
> >>>>>>            tlb->local.max  =3D ARRAY_SIZE(tlb->__pages);
> >>>>>>            tlb->active     =3D &tlb->local;
> >>>>>>            tlb->batch_count =3D 0;
> >>>>>> +
> >>>>>> +       tlb->swp_disable =3D 1;
> >>>>>> +       __tlb_swap_gather_mmu(tlb);
> >>>>>>     #endif
> >>>>>>            tlb->delayed_rmap =3D 0;
> >>>>>>
> >>>>>> @@ -466,6 +761,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
> >>>>>>
> >>>>>>     #ifndef CONFIG_MMU_GATHER_NO_GATHER
> >>>>>>            tlb_batch_list_free(tlb);
> >>>>>> +       __tlb_batch_swap_finish(tlb);
> >>>>>>     #endif
> >>>>>>            dec_tlb_flush_pending(tlb->mm);
> >>>>>>     }
> >>>>>> --
> >>>>>> 2.39.0
> >>>>>>
> >>> Thanks
> >>> Barry
> >> Thanks
> >> Zhiguo
> >>
> > Thanks
> > Barry
>

Thanks
Barry