From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6C284C54E58
	for <linux-mm@archiver.kernel.org>; Wed, 13 Mar 2024 10:37:36 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C95538001F; Wed, 13 Mar 2024 06:37:35 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C4578940010; Wed, 13 Mar 2024 06:37:35 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B0D718001F; Wed, 13 Mar 2024 06:37:35 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id A1434940010
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 06:37:35 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 2C07D140EBD
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 10:37:35 +0000 (UTC)
X-FDA: 81891664470.06.9BB6688
Received: from mail-ua1-f54.google.com (mail-ua1-f54.google.com [209.85.222.54])
	by imf17.hostedemail.com (Postfix) with ESMTP id 8139640012
	for <linux-mm@kvack.org>; Wed, 13 Mar 2024 10:37:32 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Bvc819fX;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710326252;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=;
	b=aOXkHJP6XEKckLO4Gq50LDPkJcHutQ0iSO0RjMKhsbEFXuft1qr5T/VqXWVfHwd9urrpu7
	mdX70/ZkUSfqQfDnUlMmwVxL9nK9o4SRO4e/OuHnwWAaJeXJ9K2FponC/VKCoHFipdmqb/
	FVKMezvAgVqGxG5lFa0SAkroskRAFz8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710326252; a=rsa-sha256;
	cv=none;
	b=ZTf+El6qvlRROONZ8cXQxUuXHKEgIzuiPCK2CFygxF5UHk8+oi/6nKJC6YfhFQxpfLgkVO
	YKV41DCpQR3hod8A+3DAHCVb+joAbrkKyhOoY+unKsK3G0aO0Pv4h+1nVN8Ognv5jHb70Q
	2WFnyC/tY6miKwjXYkaqjf3vvqbfth8=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Bvc819fX;
	spf=pass (imf17.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ua1-f54.google.com with SMTP id a1e0cc1a2514c-7dbb7f8016eso437177241.1
        for <linux-mm@kvack.org>; Wed, 13 Mar 2024 03:37:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710326251; x=1710931051; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=;
        b=Bvc819fX8wz6DGD37CMYrOqAWgYxk6mSqWu3viu8dBeN6Pj7Q8TS0KaI83yFKBcqzw
         QEUFnRhHI0UoxZIwusPkQQPNYE98AZ5SLWFKmT1JqlbShQn3uv6JIVxqLJI4ksi+IFTR
         xzeGwUwX3Xynw7Wq99BoOsyZO88E9Z7C9grEfmdL1tltYFnK3mD+FYPh9i2bWrwPuAGC
         P9fhvBq/9+LJnk2Jx00UcFQFvzhZUBsj/kqRpDTfp0x4K2IFGe4UNF++8aAE86aJxu1i
         LfkzQFXMS9swTW67r+cdTFXiFFbbC13qO1qA0i9NFlgqkUD8Ae1M8tZGUm8Awcz+So49
         NbTw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710326251; x=1710931051;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cCYFlkZ7ngILjlabGFnEccTfsaoVeZ0IKreojfnJ8PE=;
        b=pm9FvzwWqhHBprzY3mHook9x3Ff3M9pWndqdS1OPR1SYJLXozb6QD2isaIIq+Uxo4n
         WaHcpYTxiQNnkJOTIoXWnA+5HKSnVloKSJE4giywivMBFAfjKcSZqUCxlmzd4lWEbr7+
         ys5BV9gbWlJrsUZ/rxj30LRpCGCFdPay9cp4P0AvKly/6eOTwvSa4De+masJoSMLzF5c
         jFqzQI7JFdlXATf0xcRccZptNUQpqQrj2SQomKJ4ZDCCz7uz1CfNW25CCO+CdZb4OZ29
         NWaZohp/BSiVd1BYcOw0dcnxCIm3rp+vt/FTF1gCh0LqpxTsuQI15T/Ghv7gIIgzLTTi
         cEmg==
X-Forwarded-Encrypted: i=1; AJvYcCWcU3li039YQlgwdjrFphx6M6JzianTtOBZ8XD3Kln4rr1sxnusQtZzS7wLXaWPBZHeCQxPxBSUhYWxni+XIow8OEE=
X-Gm-Message-State: AOJu0YycYIic49B/qORe4BcvcNL8L5+vlRLJQgil/1JuGljNYbDxbNAl
	VEefuQNU0nYiWZtDt6Ag0W9L0xGCRJc7dAButZrPlyRSHEK/klWudMwdhGOboXB4NIteimwyE9v
	u6D0U+mewNFWSjCbQI2fubkaQ71U=
X-Google-Smtp-Source: AGHT+IEelr0D88mw7GjjdiDzBCc5iIgK/be9AwLVn+7bjZsDaYGkyFeadHgCw5YMgzA+ryEtl5cSmb6mLYUsSByEW6g=
X-Received: by 2002:a67:f288:0:b0:474:c786:87a4 with SMTP id
 m8-20020a67f288000000b00474c78687a4mr1478158vsk.3.1710326251332; Wed, 13 Mar
 2024 03:37:31 -0700 (PDT)
MIME-Version: 1.0
References: <20240311150058.1122862-1-ryan.roberts@arm.com>
 <20240311150058.1122862-7-ryan.roberts@arm.com> <CAGsJ_4wpjqRsn7ouO=Ut9oMBLSh803=XuSPX6gJ5nQ3jyqh3hQ@mail.gmail.com>
 <a75ec640-d025-45ee-b74d-305aaa3cc1ce@arm.com> <CAGsJ_4wodFkL4YZ1iQveUjK6QL7sNajyayBq4hJ3-GPoWJ6foQ@mail.gmail.com>
 <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com>
In-Reply-To: <00a3ba1d-98e1-409b-ae6e-7fbcbdcd74d5@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 13 Mar 2024 23:37:19 +1300
Message-ID: <CAGsJ_4xpiyCaNmSK4P3PitKeOsDBmOzf-4AZPnHcv4S=-TmtzQ@mail.gmail.com>
Subject: Re: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Lance Yang <ioworker0@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, 
	David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, 
	Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, Yu Zhao <yuzhao@google.com>, 
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>, 
	Kefeng Wang <wangkefeng.wang@huawei.com>, Chris Li <chrisl@kernel.org>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 8139640012
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-Stat-Signature: oum4exchc9jri8pfkbtpobat3yrd6p4b
X-HE-Tag: 1710326252-286679
X-HE-Meta: U2FsdGVkX19LNz9w2TpQV9wOzu4hiO1+KtakuL8aN6SXbfQQl+t7KHtLoCEh3hpL7jOx0a9NECh9EHj01lRjgPuww/R9/XWp52L4jsHakh2vC06fVg47Fi3DJ8D7zXP+R+uv1Q+2hU+olZmaaiv42CxVYuHRUDyIFfHVfl3Ms78ytgtwleoADqH6T89o7d8+cdAck+lJoezPUkVR0CDlFkoKC7hwmbKlZxKDYqvO3/G8fh8KwFOGjvUuhycv2s+w4D+4OmtN+sLMX367Ea9GIXLGmh2vYQgDWdzXF6OuJSjquk8UPHS+P1vMCViiUZC2sn4dgv39vHKzMUI1+JxFE5JTnh5exH7BQd6XaqWJhZtmPovjWjRQzi8N2WJ4ja9cD4VVtkQ5M516gJAovkmQgtNqQ6FWvusOS/n/cRxwaZlkNQsC94UtGvxroe5F2spaQUnGtMQjPa16kLP1oC6w1gU0sqdvKGWUHfTsgCxDIM/0urDeF5THP92EKiQSojEHXzhWc1L2raXNKa3dou6/Q+26AqH5B0h797P+7PVuSGwqXmMybIy6B9yikUclt1eS0qtOJHMLTefky/G+uoxE+QgZ7UYIhGDUGv9wtBXGFier231F3u7zGFh6wSsGpSgk2QvZrFV1TYIj9Z2WDv2lNTnoKh+OQPV2f1mvraPobgGFwGuu1G5KTmD7OA/bDDClOWrnHLD2YffW4NGO6Pwcy5HBuEZofnKXrhmNVyV1V0TZIe9Ntd/ucbchgzY4ID8+aliyKLKJMZPXge9qPwIR3/SZs6LDD1RqSA0XtjzMj9spOBhr7KxvF0GzGhvBKJgYcEXJYI9XxcTL5YJoS8Iwnma45gMw/eLXvXzxXFRtYQG7Nnb+S2dsNoqzv64vgMKfde172KMr7dNaTpi3QSKiPFg4v04etTOAV2zJAGFo2/tf9ZzFyhNdpFQDo3wkoSjR5UwsDuxtckyqSdgWIRL
 2ZBd4Qj8
 h3iiCxSkhFUni5pZkwx0XSEz1bo9W0ErBZmfNIdQosA3rRqYQpkFRYaxEtcwrB8bgALNT/TRky8ACksSWJ2iskoj2SA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Mar 13, 2024 at 10:36=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com=
> wrote:
>
> On 13/03/2024 09:16, Barry Song wrote:
> > On Wed, Mar 13, 2024 at 10:03=E2=80=AFPM Ryan Roberts <ryan.roberts@arm=
.com> wrote:
> >>
> >> On 13/03/2024 07:19, Barry Song wrote:
> >>> On Tue, Mar 12, 2024 at 4:01=E2=80=AFAM Ryan Roberts <ryan.roberts@ar=
m.com> wrote:
> >>>>
> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any la=
rge
> >>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>> range. This change means that large folios will be maintained all th=
e
> >>>> way to swap storage. This both improves performance during swap-out,=
 by
> >>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>> maintaining the large folio when it is swapped back in (to be covere=
d in
> >>>> a separate series).
> >>>>
> >>>> Folios that are not fully mapped in the target range are still split=
,
> >>>> but note that behavior is changed so that if the split fails for any
> >>>> reason (folio locked, shared, etc) we now leave it as is and move to=
 the
> >>>> next pte in the range and continue work on the proceeding folios.
> >>>> Previously any failure of this sort would cause the entire operation=
 to
> >>>> give up and no folios mapped at higher addresses were paged out or m=
ade
> >>>> cold. Given large folios are becoming more common, this old behavior
> >>>> would have likely lead to wasted opportunities.
> >>>>
> >>>> While we are at it, change the code that clears young from the ptes =
to
> >>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>> where the old approach would require unfolding/refolding and the new
> >>>> approach can be done in place.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>
> >>> This looks so much better than our initial RFC.
> >>> Thank you for your excellent work!
> >>
> >> Thanks - its a team effort - I had your PoC and David's previous batch=
ing work
> >> to use as a template.
> >>
> >>>
> >>>> ---
> >>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++-------------------=
---
> >>>>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd=
_t *pmd,
> >>>>         LIST_HEAD(folio_list);
> >>>>         bool pageout_anon_only_filter;
> >>>>         unsigned int batch_count =3D 0;
> >>>> +       int nr;
> >>>>
> >>>>         if (fatal_signal_pending(current))
> >>>>                 return -EINTR;
> >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd=
_t *pmd,
> >>>>                 return 0;
> >>>>         flush_tlb_batched_pending(mm);
> >>>>         arch_enter_lazy_mmu_mode();
> >>>> -       for (; addr < end; pte++, addr +=3D PAGE_SIZE) {
> >>>> +       for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE) {
> >>>> +               nr =3D 1;
> >>>>                 ptent =3D ptep_get(pte);
> >>>>
> >>>>                 if (++batch_count =3D=3D SWAP_CLUSTER_MAX) {
> >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(p=
md_t *pmd,
> >>>>                         continue;
> >>>>
> >>>>                 /*
> >>>> -                * Creating a THP page is expensive so split it only=
 if we
> >>>> -                * are sure it's worth. Split it if we are only owne=
r.
> >>>> +                * If we encounter a large folio, only split it if i=
t is not
> >>>> +                * fully mapped within the range we are operating on=
. Otherwise
> >>>> +                * leave it as is so that it can be swapped out whol=
e. If we
> >>>> +                * fail to split a folio, leave it in place and adva=
nce to the
> >>>> +                * next pte in the range.
> >>>>                  */
> >>>>                 if (folio_test_large(folio)) {
> >>>> -                       int err;
> >>>> -
> >>>> -                       if (folio_estimated_sharers(folio) > 1)
> >>>> -                               break;
> >>>> -                       if (pageout_anon_only_filter && !folio_test_=
anon(folio))
> >>>> -                               break;
> >>>> -                       if (!folio_trylock(folio))
> >>>> -                               break;
> >>>> -                       folio_get(folio);
> >>>> -                       arch_leave_lazy_mmu_mode();
> >>>> -                       pte_unmap_unlock(start_pte, ptl);
> >>>> -                       start_pte =3D NULL;
> >>>> -                       err =3D split_folio(folio);
> >>>> -                       folio_unlock(folio);
> >>>> -                       folio_put(folio);
> >>>> -                       if (err)
> >>>> -                               break;
> >>>> -                       start_pte =3D pte =3D
> >>>> -                               pte_offset_map_lock(mm, pmd, addr, &=
ptl);
> >>>> -                       if (!start_pte)
> >>>> -                               break;
> >>>> -                       arch_enter_lazy_mmu_mode();
> >>>> -                       pte--;
> >>>> -                       addr -=3D PAGE_SIZE;
> >>>> -                       continue;
> >>>> +                       const fpb_t fpb_flags =3D FPB_IGNORE_DIRTY |
> >>>> +                                               FPB_IGNORE_SOFT_DIRT=
Y;
> >>>> +                       int max_nr =3D (end - addr) / PAGE_SIZE;
> >>>> +
> >>>> +                       nr =3D folio_pte_batch(folio, addr, pte, pte=
nt, max_nr,
> >>>> +                                            fpb_flags, NULL);
> >>>
> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>> are doing madvise() on a portion of a large folio.
> >>
> >> Good idea. Something like this?:
> >>
> >>         if (pte_pfn(pte) =3D=3D folio_pfn(folio)
> >
> > what about
> >
> > "If (pte_pfn(pte) =3D=3D folio_pfn(folio) && max_nr >=3D nr_pages)"
> >
> >  just to account for cases where the user's end address falls within
> > the middle of a large folio?
>
> yes, even better. I'll add this for the next version.
>
> >
> >
> > BTW, another minor issue is here:
> >
> >                 if (++batch_count =3D=3D SWAP_CLUSTER_MAX) {
> >                         batch_count =3D 0;
> >                         if (need_resched()) {
> >                                 arch_leave_lazy_mmu_mode();
> >                                 pte_unmap_unlock(start_pte, ptl);
> >                                 cond_resched();
> >                                 goto restart;
> >                         }
> >                 }
> >
> > We are increasing 1 for nr ptes, thus, we are holding PTL longer
> > than small folios case? we used to increase 1 for each PTE.
> > Does it matter?
>
> I thought about that, but the vast majority of the work is per-folio, not
> per-pte. So I concluded it would be best to continue to increment per-fol=
io.

Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
cond_resched() in madvise_cold_or_pageout_pte_range()")
primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
and MADV_COLD are much less critical compared
to other scenarios where operations like do_anon_page or do_swap_page
necessarily need PTL to progress. Therefore, adopting
an approach that relatively aggressively releases the PTL seems to
neither harm MADV_PAGEOUT/COLD nor disadvantage
others.

We are slightly increasing the duration of holding the PTL due to the
iteration of folio_pte_batch() potentially taking longer than
the case of small folios, which do not require it. However, compared
to operations like folio_isolate_lru() and folio_deactivate(),
this increase seems negligible. Recently, we have actually removed
ptep_test_and_clear_young() for MADV_PAGEOUT,
which should also benefit real-time scenarios. Nonetheless, there is a
small risk with large folios, such as 1 MiB mTHP, where
we may need to loop 256 times in folio_pte_batch().

I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
1 for two reasons:

1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
improving them by reducing the time taken to put the same
number of pages into the reclaim list.

2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
genuinely require the PTL to progress. Moreover,
the majority of time spent on PAGEOUT is actually reclaim_pages().

> >
> >>                 nr =3D folio_pte_batch(folio, addr, pte, ptent, max_nr=
,
> >>                                      fpb_flags, NULL);
> >>
> >> If we are not mapping the first page of the folio, then it can't be a =
full
> >> mapping, so no need to call folio_pte_batch(). Just split it.
> >>
> >>>
> >>>> +
> >>>> +                       if (nr < folio_nr_pages(folio)) {
> >>>> +                               int err;
> >>>> +
> >>>> +                               if (folio_estimated_sharers(folio) >=
 1)
> >>>> +                                       continue;
> >>>> +                               if (pageout_anon_only_filter && !fol=
io_test_anon(folio))
> >>>> +                                       continue;
> >>>> +                               if (!folio_trylock(folio))
> >>>> +                                       continue;
> >>>> +                               folio_get(folio);
> >>>> +                               arch_leave_lazy_mmu_mode();
> >>>> +                               pte_unmap_unlock(start_pte, ptl);
> >>>> +                               start_pte =3D NULL;
> >>>> +                               err =3D split_folio(folio);
> >>>> +                               folio_unlock(folio);
> >>>> +                               folio_put(folio);
> >>>> +                               if (err)
> >>>> +                                       continue;
> >>>> +                               start_pte =3D pte =3D
> >>>> +                                       pte_offset_map_lock(mm, pmd,=
 addr, &ptl);
> >>>> +                               if (!start_pte)
> >>>> +                                       break;
> >>>> +                               arch_enter_lazy_mmu_mode();
> >>>> +                               nr =3D 0;
> >>>> +                               continue;
> >>>> +                       }
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>>                  * Do not interfere with other mappings of this foli=
o and
> >>>> -                * non-LRU folio.
> >>>> +                * non-LRU folio. If we have a large folio at this p=
oint, we
> >>>> +                * know it is fully mapped so if its mapcount is the=
 same as its
> >>>> +                * number of pages, it must be exclusive.
> >>>>                  */
> >>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) =
!=3D 1)
> >>>> +               if (!folio_test_lru(folio) ||
> >>>> +                   folio_mapcount(folio) !=3D folio_nr_pages(folio)=
)
> >>>>                         continue;
> >>>
> >>> This looks so perfect and is exactly what I wanted to achieve.
> >>>
> >>>>
> >>>>                 if (pageout_anon_only_filter && !folio_test_anon(fol=
io))
> >>>>                         continue;
> >>>>
> >>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> -               if (!pageout && pte_young(ptent)) {
> >>>> -                       ptent =3D ptep_get_and_clear_full(mm, addr, =
pte,
> >>>> -                                                       tlb->fullmm)=
;
> >>>> -                       ptent =3D pte_mkold(ptent);
> >>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> +               if (!pageout) {
> >>>> +                       for (; nr !=3D 0; nr--, pte++, addr +=3D PAG=
E_SIZE) {
> >>>> +                               if (ptep_test_and_clear_young(vma, a=
ddr, pte))
> >>>> +                                       tlb_remove_tlb_entry(tlb, pt=
e, addr);
> >>>> +                       }
> >>>
> >>> This looks so smart. if it is not pageout, we have increased pte
> >>> and addr here; so nr is 0 and we don't need to increase again in
> >>> for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE)
> >>>
> >>> otherwise, nr won't be 0. so we will increase addr and
> >>> pte by nr.
> >>
> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >> madvise_free_pte_range().
> >>
> >>
> >>>
> >>>
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> Overall, LGTM,
> >>>
> >>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Thanks!
> >>

Thanks
Barry