From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D60ABC4167B
	for <linux-mm@archiver.kernel.org>; Wed, 29 Nov 2023 13:01:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 566776B03D2; Wed, 29 Nov 2023 08:01:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 516596B03D3; Wed, 29 Nov 2023 08:01:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 404A66B03D4; Wed, 29 Nov 2023 08:01:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 2F6C06B03D2
	for <linux-mm@kvack.org>; Wed, 29 Nov 2023 08:01:06 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id DA0C01603E3
	for <linux-mm@kvack.org>; Wed, 29 Nov 2023 13:01:05 +0000 (UTC)
X-FDA: 81511002090.01.A618DBA
Received: from mail-oo1-f45.google.com (mail-oo1-f45.google.com [209.85.161.45])
	by imf20.hostedemail.com (Postfix) with ESMTP id 6D3CD1C002F
	for <linux-mm@kvack.org>; Wed, 29 Nov 2023 13:01:01 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=bwcJHe15;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1701262861;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=m/osv4SLAGF8T9TC0aSRSSvUZ8FSv958TpMtRqbGApY=;
	b=2xWIfybfZAWNcfwrnuSLhxQgk8aAk+Ao4UTbcRjHvt7IE76a/jS+a91+/PBfFiajFtXfLm
	OxrzTXtj1Hgyh0ZSQCs6sCY+zwcBwFKwj+sUoQIVPsSYpvmfzTMe1q8pK2OpmQBthPUftU
	Q7m56CqRbaBP4Xgrye4wU0qpGmGhNP4=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=bwcJHe15;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.161.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701262861; a=rsa-sha256;
	cv=none;
	b=sYDoNrAS9QNt097AOA2CDaty9jKG4pZSnmbyOqL5ljM798tywMamaChasBLURr0ZeBUNT2
	Z2I7p5XRuB2wmk3K4vnAGoNvHiv36O5sMtOZbe69J/EZfrXmoD3fQcPt3e+7DkweK2OA8/
	uqQ4vqF4v0iytSd9JUo20nO/AKYFpNw=
Received: by mail-oo1-f45.google.com with SMTP id 006d021491bc7-58cdc801f69so3924344eaf.3
        for <linux-mm@kvack.org>; Wed, 29 Nov 2023 05:01:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1701262860; x=1701867660; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=m/osv4SLAGF8T9TC0aSRSSvUZ8FSv958TpMtRqbGApY=;
        b=bwcJHe15Q0d9oat+dLiu8DF36S+R3XSRNkO+tvy3sHIdhShVKjEm8EotnHVeKvzlRV
         q93YfW9UP3I8MXZmxblFFdLfDe+4tFxSgIqiWR6UKOSOMMIPmgpwoExNfkdCgtQHDaAW
         LfM5jWN6nW19LonTIVT7V7bO32+hjsMsZ0DdZYuBglDGvXpSoR2NWrNEjyJ1ES2OyVKr
         wtNaZJbM2opQFsQ2cg5/Cy8n2IMalHr0kmMKpqFfL9TkxBP3MjOMnjg8czlTLXDHxfc8
         yIVc7KC3hsRchTexk5ht5o73ji60NhXXA9SRm4SV/km9psfUrcS/wxWl38s44e7aoTdx
         nG5w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1701262860; x=1701867660;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=m/osv4SLAGF8T9TC0aSRSSvUZ8FSv958TpMtRqbGApY=;
        b=ripj70QgSksmnf8mTuXPgM5cBRDFY+mKn1enxFF+EhR6tH6xZwZtCaEm/Nb/9FEdj1
         nteDH2GoDuajFpy/iCKuoGEFnKS9vL9vQzHztiXCjERlcn8LPy3t3CUj0ucoLj1Y/JIC
         qiS6ycL3/CwLpKk6Pbk4zJYL1PG+G5DEBHvzv62Gjo2EjbccDze+m3Pzyuk1n5WZst5R
         O3zTxzgnf9PGAup8HnuZSaow0Y0irz5l9q99XMB1qLV9WYS8e/3rnlhwHBOS9e4EdD59
         ehr/MwURdxJnoz3LJw2XOhk9rspUvuLISj+ESCp7h4A3lg7LpH+DiWguy0VMvYFdEloD
         noZQ==
X-Gm-Message-State: AOJu0Yxm4akFNZZPTH66wnEyp1JimS3QusAzXmiedFGOKv2QyVy2BeFm
	NYLQO4hRSMaLRNxCkXGAs/A9X1kZrWDL9d+0e1A=
X-Google-Smtp-Source: AGHT+IHV7hXN1le//YwsM4l6fLO7Z4m5WvSTTnBAZ76/cObJRjkfBgZYGIraeXELf7u5iivLpQCUMhFdRZanzY3+pjk=
X-Received: by 2002:a05:6358:9043:b0:16c:499:bbbd with SMTP id
 f3-20020a056358904300b0016c0499bbbdmr18906311rwf.21.1701262860141; Wed, 29
 Nov 2023 05:01:00 -0800 (PST)
MIME-Version: 1.0
References: <20231115163018.1303287-15-ryan.roberts@arm.com>
 <20231128081742.39204-1-v-songbaohua@oppo.com> <207de995-6d48-41ea-8373-2f9caad9b9c3@arm.com>
 <CAGsJ_4wV-z7u5N3oLM-3kONHe0fmQwO7CSWQk9w0u0EhMroXAA@mail.gmail.com> <34da1e06-74da-4e45-b0b5-9c93d64eb64e@arm.com>
In-Reply-To: <34da1e06-74da-4e45-b0b5-9c93d64eb64e@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 30 Nov 2023 02:00:48 +1300
Message-ID: <CAGsJ_4xLiRWyopZGe+v7RLTMajjtoSYixq_wTxuEr1W8rWsS7w@mail.gmail.com>
Subject: Re: [PATCH v2 14/14] arm64/mm: Add ptep_get_and_clear_full() to
 optimize process teardown
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, 
	ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, 
	dvyukov@google.com, glider@google.com, james.morse@arm.com, 
	jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, 
	maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, 
	suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, 
	will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, 
	ziy@nvidia.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 6D3CD1C002F
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: 9sez15ei7dg5kpx7gdeyshynmtmehfb8
X-HE-Tag: 1701262861-977675
X-HE-Meta: U2FsdGVkX1/dVxnPfeJUPSm7pIFvDJzLPpUzxosmjXIMo+Tg38pHExhngwi7M2JnC14sZZv6FAWaqcqmSTzn9d/eKziu85OI/16p24m1X53cWheNANx2adEl0lqy+EX1HFpIJJZSvgw46WGNH5+7wJyuIdIKWgAjrC4vUW1eQ+hmAW50QKUl6GvbyAqJDf621dDHwPKbDcr3Qp3IyNg4nIIT4SMea98I/M9bUC4gbRKig2MVM7Unx+PRtg1JFQ4kgXn7kMv/Ud4hNWORZ+4gsTri5Ei1o74thSjqfS2RcFrCciAheOtAGz+WhwIBeXXIA2WQm1bRUSaV1aSwuGGg5DGXgC1i6/DTwHUaXQgSHEXWljgXR24rvbSoYi44kdxprygdB3ohI4AzE4WTXB1PIkqehlu0wDwT8RaSXSjCXCCTz51gQ9QzYpijT0ItnfOWZN+LKyGL8oIN6RMJCu5ULx/d7bdbk52Wz8EjVlHT08IGM7v1VB3PWyQVQm7doKVrG+aU0LomKGbh9h6RAQVq9ue7GU2dEBA4hNVFBmK4EysA3oZLYxnYXKfJauIVsNS6ay9A6SsvPzSiEVoJtZV9NxsnRulDeVvtNVdNLIhDdt4PDkx0nnpTYpiRH6D9aESC2DetmzHwO3r868wEx7MQMNdhJu+RF5DerqRROAdA0LEi7609D1a+V0st9m7HNTlfrO77AO8vDyqt7NHw5zdNVlEf5/qpYZB0EVaudaX0spK4f+Y6k0p/V1ihkobwmYTUvZr+mAvoM8OSFizNRL0hhChgOUzocl7/LMAbCPa9k5Ikt7TSIxsO9v+RF0KJ7JWav4iwwDProWHchJr2ymUwVVqcNXXeNaDGZrfKNqUlGaUlC7DTVEfB64MX8NQcQGihadZRr6myh1LDB8zsN0tavHNCb7d8giSCq0z4WYPz6DpIjos5qehHzvmdX71PlH+1fMUJ8TQA4rO6GwGtHI/
 1Cj5j9ND
 b4OyAqYGzhUPLWJ+++LY8p/8B9RGl1AvjNOWwZ5Dt6r3Vk+NqHdH1T1WKrJ+5Chig6ll+L1zoiKICqmBHx+nO2nuva01K6UaMET84zo4EQaFNlLcLfBNtXaV3J7/FbnDjVVcnaFS+3R0RQ5SP/Ktqw4cxKT4AM+GCXbiyRsa0tMM9EeutqodAuOTWGclJJCglruT+wjg3mOX0ie7QgjJHe0lRs2wYYxQteMIYOKJ2rh62L+BOBC1KfLTyHKzSNyMROU9VrkqUpIV5VTZfUGqmiu78lVeqGKQn8QsDJlSO7qWimjzbGTjXsommuTLzyApMUf8CqEGMXXxF6Hzyqf9p8XaFw2NlU+Mq+2Td3NrUgjZk8tSKIfoLwg0heZza3Womi4YPApdPIF9zZatzJSo+ARm485HT4TTDyCaWnfh4pJ6wxgXNweCDVzPv0gxhNsMk4qbmTfCLUvbRKfmHbEEQWopqMqXBM2qeJdM/GZIhdEZQs+jHjTNa9VFvDLTTuamtYJxUIRbIpDiUZ0wBn4wq+3e2rZaY/z38TcaOT3Uya/PDrK0=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Nov 30, 2023 at 1:43=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 28/11/2023 20:23, Barry Song wrote:
> > On Wed, Nov 29, 2023 at 12:49=E2=80=AFAM Ryan Roberts <ryan.roberts@arm=
.com> wrote:
> >>
> >> On 28/11/2023 08:17, Barry Song wrote:
> >>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >>>> +                                    unsigned long addr, pte_t *ptep=
)
> >>>> +{
> >>>> +    /*
> >>>> +     * When doing a full address space teardown, we can avoid unfol=
ding the
> >>>> +     * contiguous range, and therefore avoid the associated tlbi. I=
nstead,
> >>>> +     * just get and clear the pte. The caller is promising to call =
us for
> >>>> +     * every pte, so every pte in the range will be cleared by the =
time the
> >>>> +     * tlbi is issued.
> >>>> +     *
> >>>> +     * This approach is not perfect though, as for the duration bet=
ween
> >>>> +     * returning from the first call to ptep_get_and_clear_full() a=
nd making
> >>>> +     * the final call, the contpte block in an intermediate state, =
where
> >>>> +     * some ptes are cleared and others are still set with the PTE_=
CONT bit.
> >>>> +     * If any other APIs are called for the ptes in the contpte blo=
ck during
> >>>> +     * that time, we have to be very careful. The core code current=
ly
> >>>> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get=
() and so
> >>>> +     * ptep_get() must be careful to ignore the cleared entries whe=
n
> >>>> +     * accumulating the access and dirty bits - the same goes for
> >>>> +     * ptep_get_lockless(). The only other calls we might resonably=
 expect
> >>>> +     * are to set markers in the previously cleared ptes. (We shoul=
dn't see
> >>>> +     * valid entries being set until after the tlbi, at which point=
 we are
> >>>> +     * no longer in the intermediate state). Since markers are not =
valid,
> >>>> +     * this is safe; set_ptes() will see the old, invalid entry and=
 will not
> >>>> +     * attempt to unfold. And the new pte is also invalid so it won=
't
> >>>> +     * attempt to fold. We shouldn't see this for the 'full' case a=
nyway.
> >>>> +     *
> >>>> +     * The last remaining issue is returning the access/dirty bits.=
 That
> >>>> +     * info could be present in any of the ptes in the contpte bloc=
k.
> >>>> +     * ptep_get() will gather those bits from across the contpte bl=
ock. We
> >>>> +     * don't bother doing that here, because we know that the infor=
mation is
> >>>> +     * used by the core-mm to mark the underlying folio as accessed=
/dirty.
> >>>> +     * And since the same folio must be underpinning the whole bloc=
k (that
> >>>> +     * was a requirement for folding in the first place), that info=
rmation
> >>>> +     * will make it to the folio eventually once all the ptes have =
been
> >>>> +     * cleared. This approach means we don't have to play games wit=
h
> >>>> +     * accumulating and storing the bits. It does mean that any int=
erleaved
> >>>> +     * calls to ptep_get() may lack correct access/dirty informatio=
n if we
> >>>> +     * have already cleared the pte that happened to store it. The =
core code
> >>>> +     * does not rely on this though.
> >>>
> >>> even without any other threads running and touching those PTEs, this =
won't survive
> >>> on some hardware. we expose inconsistent CONTPTEs to hardware, this m=
ight result
> >>
> >> No that's not the case; if you read the Arm ARM, the page table is onl=
y
> >> considered "misgrogrammed" when *valid* entries within the same contpt=
e block
> >> have different values for the contiguous bit. We are clearing the ptes=
 to zero
> >> here, which is an *invalid* entry. So if the TLB entry somehow gets in=
validated
> >> (either due to explicit tlbi as you point out below, or due to a concu=
rrent TLB
> >> miss which selects our entry for removal to make space for the new inc=
omming
> >> entry), then it gets an access request for an address in our partially=
 cleared
> >> contpte block the address will either be:
> >>
> >> A) an address for a pte entry we have already cleared, so its invalid =
and it
> >> will fault (and get serialized behind the PTL).
> >>
> >> or
> >>
> >> B) an address for a pte entry we haven't yet cleared, so it will refor=
m a TLB
> >> entry for the contpte block. But that's ok because the memory still ex=
ists
> >> because we haven't yet finished clearing the page table and have not y=
et issued
> >> the final tlbi.
> >>
> >>
> >>> in crashed firmware even in trustzone, strange&unknown faults to trus=
tzone we have
> >>> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a p=
art of PTEs
> >>> with dropped CONT but still some other PTEs have CONT, we make hardwa=
re totally
> >>> confused.
> >>
> >> I suspect this is because in your case you are "misprogramming" the co=
ntpte
> >> block; there are *valid* pte entries within the block that disagree ab=
out the
> >> contiguous bit or about various other fields. In this case some HW TLB=
 designs
> >> can do weird things. I suspect in your case, that's resulting in acces=
sing bad
> >> memory space and causing an SError, which is trapped by EL3, and the F=
W is
> >> probably just panicking at that point.
> >
> > you are probably right. as we met the SError, we became very very
> > cautious. so anytime
> > when we flush tlb for a CONTPTE, we strictly do it by
> > 1. set all 16 ptes to zero
> > 2. flush the whole 16 ptes
>
> But my point is that this sequence doesn't guarrantee that the TLB doesn'=
t read
> the page table half way through the SW clearing the 16 entries; a TLB ent=
ry can
> be ejected for other reasons than just issuing a TLBI. So in that case th=
ese 2
> flows can be equivalent. Its the fact that we are unsetting the valid bit=
 when
> clearing each pte that guarantees this to be safe.
>
> >
> > in your case, it can be:
> > 1. set pte0 to zero
> > 2. flush pte0
> >
> > TBH, i have never tried this. but it might be safe according to your
> > description.
> >
> >>
> >>>
> >>> zap_pte_range() has a force_flush when tlbbatch is full:
> >>>
> >>>                         if (unlikely(__tlb_remove_page(tlb, page, del=
ay_rmap))) {
> >>>                                 force_flush =3D 1;
> >>>                                 addr +=3D PAGE_SIZE;
> >>>                                 break;
> >>>                         }
> >>>
> >>> this means you can expose partial tlbi/flush directly to hardware whi=
le some
> >>> other PTEs are still CONT.
> >>
> >> Yes, but that's also possible even if we have a tight loop that clears=
 down the
> >> contpte block; there could still be another core that issues a tlbi wh=
ile you're
> >> halfway through that loop, or the HW could happen to evict due to TLB =
pressure
> >> at any time. The point is, it's safe if you are clearing the pte to an=
 *invalid*
> >> entry.
> >>
> >>>
> >>> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to =
depend
> >>> on fullmm, as long as zap range covers a large folio, we can flush tl=
bi for
> >>> those CONTPTEs all together in your contpte_ptep_get_and_clear_full()=
 rather
> >>> than clearing one PTE.
> >>>
> >>> Our approach in [1] is we do a flush for all CONTPTEs and go directly=
 to the end
> >>> of the large folio:
> >>>
> >>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>                       if (pte_cont(ptent)) {
> >>>                               unsigned long next =3D pte_cont_addr_en=
d(addr, end);
> >>>
> >>>                               if (next - addr !=3D HPAGE_CONT_PTE_SIZ=
E) {
> >>>                                       __split_huge_cont_pte(vma, pte,=
 addr, false, NULL, ptl);
> >>>                                       /*
> >>>                                        * After splitting cont-pte
> >>>                                        * we need to process pte again=
.
> >>>                                        */
> >>>                                       goto again_pte;
> >>>                               } else {
> >>>                                       cont_pte_huge_ptep_get_and_clea=
r(mm, addr, pte);
> >>>
> >>>                                       tlb_remove_cont_pte_tlb_entry(t=
lb, pte, addr);
> >>>                                       if (unlikely(!page))
> >>>                                               continue;
> >>>
> >>>                                       if (is_huge_zero_page(page)) {
> >>>                                               tlb_remove_page_size(tl=
b, page, HPAGE_CONT_PTE_SIZE);
> >>>                                               goto cont_next;
> >>>                                       }
> >>>
> >>>                                       rss[mm_counter(page)] -=3D HPAG=
E_CONT_PTE_NR;
> >>>                                       page_remove_rmap(page, true);
> >>>                                       if (unlikely(page_mapcount(page=
) < 0))
> >>>                                               print_bad_pte(vma, addr=
, ptent, page);
> >>>
> >>>                                       tlb_remove_page_size(tlb, page,=
 HPAGE_CONT_PTE_SIZE);
> >>>                               }
> >>> cont_next:
> >>>                               /* "do while()" will do "pte++" and "ad=
dr + PAGE_SIZE" */
> >>>                               pte +=3D (next - PAGE_SIZE - (addr & PA=
GE_MASK))/PAGE_SIZE;
> >>>                               addr =3D next - PAGE_SIZE;
> >>>                               continue;
> >>>                       }
> >>> #endif
> >>>
> >>> this is our "full" counterpart, which clear_flush CONT_PTES pages dir=
ectly, and
> >>> it never requires tlb->fullmm at all.
> >>
> >> Yes, but you are benefitting from the fact that contpte is exposed to =
core-mm
> >> and it is special-casing them at this level. I'm trying to avoid that.
> >
> > I am thinking we can even do this while we don't expose CONTPTE.
> > if zap_pte_range meets a large folio and the zap_range covers the whole
> > folio, we can flush all ptes in this folio and jump to the end of this =
folio?
> > i mean
> >
> > if (folio head && range_end > folio_end) {
> >          nr =3D folio_nr_page(folio);
> >          full_flush_nr_ptes()
> >          pte +=3D nr -1;
> >          addr +=3D (nr - 1) * basepage size
> > }
>
> Just because you found a pte that maps a page from a large folio, that do=
esn't
> mean that all pages from the folio are mapped, and it doesn't mean they a=
re
> mapped contiguously. We have to deal with partial munmap(), partial mrema=
p()
> etc. We could split in these cases (and in future it might be sensible to=
 try),
> but that can fail (due to GUP). So we still have to handle the corner cas=
e.

maybe we can check ptes and find they are all mapped (pte_present),
then do a flush_range.
This is actually the most common case. the majority of madv(DONTNEED)
will benefit from
it. if the condition is false, we fallback to your current code.
zap_pte_range always sets present ptes to 0, but swap entry is really
quite different.

>
> But I can imagine doing a batched version of ptep_get_and_clear(), like I=
 did
> for ptep_set_wrprotects(). And I think this would be an improvement.
>
> The reason I haven't done that so far, is because ptep_get_and_clear() re=
turns
> the pte value when it was cleared and that's hard to do if batching due t=
o the
> storage requirement. But perhaps you could just return the logical OR of =
the
> dirty and young bits across all ptes in the batch. The caller should be a=
ble to
> reconstitute the rest if it needs it?
>
> What do you think?
>
> >
> > zap_pte_range is the most frequent behaviour from userspace libc heap
> > as i explained
> > before. libc can call madvise(DONTNEED) the most often. It is crucial
> > to performance.
> >
> > and this way can also help drop your full version by moving to full
> > flushing the whole
> > large folios? and we don't need to depend on fullmm any more?
> >
> >>
> >> I don't think there is any correctness issue here. But there is a prob=
lem with
> >> fragility, as raised by Alistair. I have some ideas on potentially how=
 to solve
> >> that. I'm going to try to work on it this afternoon and will post if I=
 get some
> >> confidence that it is a real solution.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct m=
m_struct *mm,
> >>>                                      unsigned long addr,
> >>>                                      pte_t *ptep,
> >>>                                      bool flush)
> >>> {
> >>>       pte_t orig_pte =3D ptep_get(ptep);
> >>>
> >>>       CHP_BUG_ON(!pte_cont(orig_pte));
> >>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >>>
> >>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, fl=
ush);
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/=
oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >>>
> >>>> +     */
> >>>> +
> >>>> +    return __ptep_get_and_clear(mm, addr, ptep);
> >>>> +}
> >>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >>>> +
> >>>
> >
Thanks
Barry