From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B268EC4167B
	for <linux-mm@archiver.kernel.org>; Thu, 30 Nov 2023 00:34:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2AFFB8D0026; Wed, 29 Nov 2023 19:34:28 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 25F6D8D0001; Wed, 29 Nov 2023 19:34:28 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 102198D0026; Wed, 29 Nov 2023 19:34:28 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id F20F68D0001
	for <linux-mm@kvack.org>; Wed, 29 Nov 2023 19:34:27 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id A33228065A
	for <linux-mm@kvack.org>; Thu, 30 Nov 2023 00:34:27 +0000 (UTC)
X-FDA: 81512749374.28.7D51677
Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171])
	by imf07.hostedemail.com (Postfix) with ESMTP id CC78440003
	for <linux-mm@kvack.org>; Thu, 30 Nov 2023 00:34:25 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=M4YdL6oX;
	spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701304465; a=rsa-sha256;
	cv=none;
	b=d0go6Am6xlY4LGcQQYk7g1u2gScawtfeo4kJKaVkfUbRwvSY+uw2lOx9Qx5kKlgu0qD9kz
	xghgPQ4Tdvlhs+OmYFUAaYkBD8bn+k33Ho5FMUy9O4Vr2Z1E8q2UYJhw6ff7bhUPA4jmfN
	LcsTx6L2tO3MFPskufCMeK0yJGxh+C8=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=M4YdL6oX;
	spf=pass (imf07.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1701304465;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=4iE0EOVY/boKVVt75az+2H2IylfQ/rY8ywinKA1Q3dE=;
	b=vWMxJyBlXObqM19JJ9v9yM/pwzOOEeprB4Q/jshjW3tJqcikr9bFeEJk07ts7vOdFp51YS
	VCGqyJb5ZJSP1uNphPRmJWTIlhAbVpB7v7dl4LMyAgGSmLghF9zMZEpfKXyTgwaUCxt9q7
	YACYMyIPc5ECMTANASTddAvLrtZcMVs=
Received: by mail-vk1-f171.google.com with SMTP id 71dfb90a1353d-4b2881aa3d6so120089e0c.0
        for <linux-mm@kvack.org>; Wed, 29 Nov 2023 16:34:25 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1701304465; x=1701909265; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=4iE0EOVY/boKVVt75az+2H2IylfQ/rY8ywinKA1Q3dE=;
        b=M4YdL6oX1NFxWKzua2MWGRmdTT8JaJmlP+st8ebs3QDu9zCpnWd2lZtwVC/WwIeHUJ
         MzB1EYTOVSOmXahY4uu37KLHGfKQ0q+gwAbY2gxSTuDtHk+knW8VjU6KDlrD2ThdeBke
         1q/PkMSkCWA0CFmOL7aW8dTY12Ucglqs4ffMX2SW5NEEhtMtd3JN1CfmIRwd9XfkbZlF
         DJg5OHIYtf8Q56LN3bm1uRbmUDtvu53vxqIredc+5kakZOtTaRlkZ7gRt8+bPNsPz5Tq
         BbcNrAWXhkeFX5Kr96aWFRqWlO9qPlQUk2zAFTgxEAiI8sC7AHCj2CMeSB3mDe4wCfzd
         WxZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1701304465; x=1701909265;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=4iE0EOVY/boKVVt75az+2H2IylfQ/rY8ywinKA1Q3dE=;
        b=Bkp8q7wa6GkrorZ90Bx4KR8nXF2GFKUU1FN12nXrz4EfdmiG2D0nOcievKAW6P9nLm
         dPTwUox74AeDUT3LOSPI1EVsv0nf68TZqlNChPIv1uxxI46Dle8edVOk0neyQKDQ3XO1
         fj5716fzzxppKcUoS3S/XqsasSc6Xhnt7DvOLZmA2/RsDJlWD34DyjyhsuSpMnixuexu
         WRTz8oPa+/lQQlwy0JSODrfWgl7oOlB+ag1o3GR8fTCKNwYoHIn0K9SegyHhdKzNo0rV
         BHWlVRqPG+RdiU2E6g4bJM8B/JDdHclbyd1bt4mBXni2TC3SVcOT7hgQeJ/X6GPxp35K
         BjxQ==
X-Gm-Message-State: AOJu0YxFcLzdDrl0SrppuVI2mIjL+EVDm/cPCeUgDxq18UqRp6K2IxM3
	gKOtgOeJt2Yg0v6pSzXWsl7Ly9+062Jl2xT3SSw=
X-Google-Smtp-Source: AGHT+IH/7TiXtW+dJCvD+hegyntTLRvMUl/FFmAv+2SlmPHJ0VgdYi2kHtEuesXQ3+WWHl2KBOhCT+zweurmoUcAGEM=
X-Received: by 2002:ac5:c386:0:b0:4ab:f099:17ad with SMTP id
 s6-20020ac5c386000000b004abf09917admr19420175vkk.9.1701304464688; Wed, 29 Nov
 2023 16:34:24 -0800 (PST)
MIME-Version: 1.0
References: <20231115163018.1303287-2-ryan.roberts@arm.com>
 <20231127055414.9015-1-v-songbaohua@oppo.com> <755343a1-ce94-4d38-8317-0925e2dae3bc@arm.com>
 <CAGsJ_4z_ftxvG-EcTe=X+Te8fNSShhVHHPvbEgAa1rQXgO5XCA@mail.gmail.com>
 <a43d8da8-d902-440c-aa64-df78fa4e185d@arm.com> <CAGsJ_4w9E_90Bs9kWP1HWmKsknszw2i=Vtarj0UGD80yNpt5mw@mail.gmail.com>
 <02d85331-eaa0-4d76-a3d6-ea5eb18b683c@arm.com> <CAGsJ_4zDZb93bKMmhRn_V3a_PR11xKkaYkdxZW2ZsO6zNNj0Kw@mail.gmail.com>
 <b485e908-770b-4cd9-8e45-b9107a581ae8@arm.com>
In-Reply-To: <b485e908-770b-4cd9-8e45-b9107a581ae8@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 30 Nov 2023 13:34:13 +1300
Message-ID: <CAGsJ_4wXawhY5OsjjZUkgaBwXeFeBbcZa5NcsWETgXty=iogJg@mail.gmail.com>
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, andreyknvl@gmail.com, anshuman.khandual@arm.com, 
	ardb@kernel.org, catalin.marinas@arm.com, david@redhat.com, 
	dvyukov@google.com, glider@google.com, james.morse@arm.com, 
	jhubbard@nvidia.com, linux-arm-kernel@lists.infradead.org, 
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, mark.rutland@arm.com, 
	maz@kernel.org, oliver.upton@linux.dev, ryabinin.a.a@gmail.com, 
	suzuki.poulose@arm.com, vincenzo.frascino@arm.com, wangkefeng.wang@huawei.com, 
	will@kernel.org, willy@infradead.org, yuzenghui@huawei.com, yuzhao@google.com, 
	ziy@nvidia.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam08
X-Rspamd-Queue-Id: CC78440003
X-Stat-Signature: wqf86u9ahi9bj3ksomdfiiysbsmy79sr
X-Rspam-User: 
X-HE-Tag: 1701304465-742525
X-HE-Meta: U2FsdGVkX1+pQzlOnnY4nsFVFXEo0FyI5SL1lDNOKRAJkZQoTKngNNAQRQe+c+oQOAL/gVHuLd05WaNJmT1E4RaZdhBDRQNiSUfEV63/Fb2lYbKmTS6/KkIDK2N6N0R45ZajSLTlzSJMDmlbEfIFjCIiiYzwP7ACmn2+pQF9bJzixsD/1ACNYUMlYB9gr2SAT5rbwEQn59+mbbALlLiCP94SjYflJIVy839Rycgo9bXrYDB0JEmXo2gCi7+u+irzz6r6XKY2wdmqBBKBZNaHk+NjcwkJFTxN1cP25X6HTXxN5jEpajO3cTrT2LxXwVgm8DsFoKPEgkSqUcfKyGBfXVshAW+SEErJe88lPAlBQV6sQzul8gvGp8r356jAa1sWUzE7mWwqPdSnNVo75vJbZSXxnu/AGWIhEubFSLwrSJ6PY6EurbPlEixzeLo/OBNYleMdNeQDR8t5b7Lp/vy1KjT1IKlo8zguw8XF6/zzcFE90n0vXidIBhDgaNdo1ncy56ybN7iAt4u4T+jMv54XLkLlm1mcmWgUaSzx+I/nZKQkaIkGJL2TzqGlqpnEhztCeM4iJYUKdMedykPfiZEZMOjqvoWCoDNVNyTLnEetle1TKFxXL8SNIsJ35oOs2DpPmV/lyRiHsPF96JBIn5b3Bt8QpPrl2hS/DcTibyAeLqfLGP3Zmauvil6fzc0VrECPOr89dsiYzsYmZRQU77rF1LwIPo83BBNq95QJWiPNkRmyfalKOzrcVRaO3pHVhc6V0PiyQzLX5ZZvGuQV0X48tjjSj4Pz6YSZGCIMcPmxhvhOguAIusVMSN2LCIT3LYcIrTICYQEYLMwA8w7VTb7AS4HrqM2KsNSU8f5NLfc1+g/Re7+7kmZGVYqJG+tX6Xzh9mmMF72rQGEkApqq5W2FF+eIEjJ+ScObfPjVSSJxF7jP8oaBL/UapxAwqIkjq/iphc1v8plF6/BlAkbkZAY
 jYx0JlVh
 FNd2owau79VJr9BD2isKeY1RdqwfUSDzVtA+PtfI4fbxkfv4cPxQmsJi9dD8oDvHr2/+go7NC/BKLfLsdXTpqGm2/QI0HANZb93Nlr7bnrzGX8x2uIP6c7uPCCjkvvxY42HQJ34n2zrzh7sgt3+AHvaCCrial0ZhI6H0GXdxaxGIj3tGwRkYC5o1uJLhAKVyXch6UrIXX6SJiqvNQG7ifp6dIYreQlVB+aqoDuZyL9NfRXyT60GKb6Qn1IlIzL2yJPCpq90SIYEj458cZMd0/xUzIZFViY+d1+HEpWbk80nPp5FZYYD2QaTLC6dD0WqjJMAjpl+lzU3ZHDkGQEbcgYKjWGSP0VxgrxlSP0KcsYLhcEARFItKkqF9huFcqnh2wgh+nCNUygIx589NQe2iNpArYelawYd6Knfv9/hnOfEkp3u09230K+5tJkXFoxAH/2FOs0CR0iMjMeAtG+9iLjAP/+WBUJWU+4YCBrFBeLB1Z+9V83yYPxTxeKj6KIZkxuAPA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Nov 30, 2023 at 3:07=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 29/11/2023 13:09, Barry Song wrote:
> > On Thu, Nov 30, 2023 at 1:29=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.=
com> wrote:
> >>
> >> On 28/11/2023 19:00, Barry Song wrote:
> >>> On Wed, Nov 29, 2023 at 12:00=E2=80=AFAM Ryan Roberts <ryan.roberts@a=
rm.com> wrote:
> >>>>
> >>>> On 28/11/2023 00:11, Barry Song wrote:
> >>>>> On Mon, Nov 27, 2023 at 10:24=E2=80=AFPM Ryan Roberts <ryan.roberts=
@arm.com> wrote:
> >>>>>>
> >>>>>> On 27/11/2023 05:54, Barry Song wrote:
> >>>>>>>> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_are=
a_struct *src_vma,
> >>>>>>>> +              pte_t *dst_pte, pte_t *src_pte,
> >>>>>>>> +              unsigned long addr, unsigned long end,
> >>>>>>>> +              int *rss, struct folio **prealloc)
> >>>>>>>>  {
> >>>>>>>>      struct mm_struct *src_mm =3D src_vma->vm_mm;
> >>>>>>>>      unsigned long vm_flags =3D src_vma->vm_flags;
> >>>>>>>>      pte_t pte =3D ptep_get(src_pte);
> >>>>>>>>      struct page *page;
> >>>>>>>>      struct folio *folio;
> >>>>>>>> +    int nr =3D 1;
> >>>>>>>> +    bool anon;
> >>>>>>>> +    bool any_dirty =3D pte_dirty(pte);
> >>>>>>>> +    int i;
> >>>>>>>>
> >>>>>>>>      page =3D vm_normal_page(src_vma, addr, pte);
> >>>>>>>> -    if (page)
> >>>>>>>> +    if (page) {
> >>>>>>>>              folio =3D page_folio(page);
> >>>>>>>> -    if (page && folio_test_anon(folio)) {
> >>>>>>>> -            /*
> >>>>>>>> -             * If this page may have been pinned by the parent =
process,
> >>>>>>>> -             * copy the page immediately for the child so that =
we'll always
> >>>>>>>> -             * guarantee the pinned page won't be randomly repl=
aced in the
> >>>>>>>> -             * future.
> >>>>>>>> -             */
> >>>>>>>> -            folio_get(folio);
> >>>>>>>> -            if (unlikely(page_try_dup_anon_rmap(page, false, sr=
c_vma))) {
> >>>>>>>> -                    /* Page may be pinned, we have to copy. */
> >>>>>>>> -                    folio_put(folio);
> >>>>>>>> -                    return copy_present_page(dst_vma, src_vma, =
dst_pte, src_pte,
> >>>>>>>> -                                             addr, rss, preallo=
c, page);
> >>>>>>>> +            anon =3D folio_test_anon(folio);
> >>>>>>>> +            nr =3D folio_nr_pages_cont_mapped(folio, page, src_=
pte, addr,
> >>>>>>>> +                                            end, pte, &any_dirt=
y);
> >>>>>>>
> >>>>>>> in case we have a large folio with 16 CONTPTE basepages, and user=
space
> >>>>>>> do madvise(addr + 4KB * 5, DONTNEED);
> >>>>>>
> >>>>>> nit: if you are offsetting by 5 pages from addr, then below I thin=
k you mean
> >>>>>> page0~page4 and page6~15?
> >>>>>>
> >>>>>>>
> >>>>>>> thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages=
_cont_mapped()
> >>>>>>> will return 15. in this case, we should copy page0~page3 and page=
5~page15.
> >>>>>>
> >>>>>> No I don't think folio_nr_pages_cont_mapped() will return 15; that=
's certainly
> >>>>>> not how its intended to work. The function is scanning forwards fr=
om the current
> >>>>>> pte until it finds the first pte that does not fit in the batch - =
either because
> >>>>>> it maps a PFN that is not contiguous, or because the permissions a=
re different
> >>>>>> (although this is being relaxed a bit; see conversation with David=
H against this
> >>>>>> same patch).
> >>>>>>
> >>>>>> So the first time through this loop, folio_nr_pages_cont_mapped() =
will return 5,
> >>>>>> (page0~page4) then the next time through the loop we will go throu=
gh the
> >>>>>> !present path and process the single swap marker. Then the 3rd tim=
e through the
> >>>>>> loop folio_nr_pages_cont_mapped() will return 10.
> >>>>>
> >>>>> one case we have met by running hundreds of real phones is as below=
,
> >>>>>
> >>>>>
> >>>>> static int
> >>>>> copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struc=
t *src_vma,
> >>>>>                pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> >>>>>                unsigned long end)
> >>>>> {
> >>>>>         ...
> >>>>>         dst_pte =3D pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_=
ptl);
> >>>>>         if (!dst_pte) {
> >>>>>                 ret =3D -ENOMEM;
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, addr, &s=
rc_ptl);
> >>>>>         if (!src_pte) {
> >>>>>                 pte_unmap_unlock(dst_pte, dst_ptl);
> >>>>>                 /* ret =3D=3D 0 */
> >>>>>                 goto out;
> >>>>>         }
> >>>>>         spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> >>>>>         orig_src_pte =3D src_pte;
> >>>>>         orig_dst_pte =3D dst_pte;
> >>>>>         arch_enter_lazy_mmu_mode();
> >>>>>
> >>>>>         do {
> >>>>>                 /*
> >>>>>                  * We are holding two locks at this point - either =
of them
> >>>>>                  * could generate latencies in another task on anot=
her CPU.
> >>>>>                  */
> >>>>>                 if (progress >=3D 32) {
> >>>>>                         progress =3D 0;
> >>>>>                         if (need_resched() ||
> >>>>>                             spin_needbreak(src_ptl) || spin_needbre=
ak(dst_ptl))
> >>>>>                                 break;
> >>>>>                 }
> >>>>>                 ptent =3D ptep_get(src_pte);
> >>>>>                 if (pte_none(ptent)) {
> >>>>>                         progress++;
> >>>>>                         continue;
> >>>>>                 }
> >>>>>
> >>>>> the above iteration can break when progress > =3D32. for example, a=
t the
> >>>>> beginning,
> >>>>> if all PTEs are none, we break when progress >=3D32, and we break w=
hen we
> >>>>> are in the 8th pte of 16PTEs which might become CONTPTE after we re=
lease
> >>>>> PTL.
> >>>>>
> >>>>> since we are releasing PTLs, next time when we get PTL, those pte_n=
one() might
> >>>>> become pte_cont(), then are you going to copy CONTPTE from 8th pte,
> >>>>> thus, immediately
> >>>>> break the consistent CONPTEs rule of hardware?
> >>>>>
> >>>>> pte0 - pte_none
> >>>>> pte1 - pte_none
> >>>>> ...
> >>>>> pte7 - pte_none
> >>>>>
> >>>>> pte8 - pte_cont
> >>>>> ...
> >>>>> pte15 - pte_cont
> >>>>>
> >>>>> so we did some modification to avoid a break in the middle of PTEs
> >>>>> which can potentially
> >>>>> become CONTPE.
> >>>>> do {
> >>>>>                 /*
> >>>>>                 * We are holding two locks at this point - either o=
f them
> >>>>>                 * could generate latencies in another task on anoth=
er CPU.
> >>>>>                 */
> >>>>>                 if (progress >=3D 32) {
> >>>>>                                 progress =3D 0;
> >>>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>>>                 /*
> >>>>>                 * XXX: don't release ptl at an unligned address as
> >>>>> cont_pte might form while
> >>>>>                 * ptl is released, this causes double-map
> >>>>>                 */
> >>>>>                 if (!vma_is_chp_anonymous(src_vma) ||
> >>>>>                    (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(add=
r,
> >>>>> HPAGE_CONT_PTE_SIZE)))
> >>>>> #endif
> >>>>>                 if (need_resched() ||
> >>>>>                    spin_needbreak(src_ptl) || spin_needbreak(dst_pt=
l))
> >>>>>                                 break;
> >>>>> }
> >>>>>
> >>>>> We could only reproduce the above issue by running thousands of pho=
nes.
> >>>>>
> >>>>> Does your code survive from this problem?
> >>>>
> >>>> Yes I'm confident my code is safe against this; as I said before, th=
e CONT_PTE
> >>>> bit is not blindly "copied" from parent to child pte. As far as the =
core-mm is
> >>>> concerned, there is no CONT_PTE bit; they are just regular PTEs. So =
the code
> >>>> will see some pte_none() entries followed by some pte_present() entr=
ies. And
> >>>> when calling set_ptes() on the child, the arch code will evaluate th=
e current
> >>>> state of the pgtable along with the new set_ptes() request and deter=
mine where
> >>>> it should insert the CONT_PTE bit.
> >>>
> >>> yep, i have read very carefully and think your code is safe here. The
> >>> only problem
> >>> is that the code can randomly unfold parent processes' CONPTE while s=
etting
> >>> wrprotect in the middle of a large folio while it actually should kee=
p CONT
> >>> bit as all PTEs can be still consistent if we set protect from the 1s=
t PTE.
> >>>
> >>> while A forks B,  progress >=3D 32 might interrupt in the middle of a
> >>> new CONTPTE folio which is forming, as we have to set wrprotect to pa=
rent A,
> >>> this parent immediately loses CONT bit. this is  sad. but i can't fin=
d a
> >>> good way to resolve it unless CONT is exposed to mm-core. any idea on
> >>> this?
> >>
> >> No this is not the case; copy_present_ptes() will copy as many ptes as=
 are
> >> physcially contiguous and belong to the same folio (which usually mean=
s "the
> >> whole folio" - the only time it doesn't is when we hit the end of the =
vma). We
> >> will then return to the main loop and move forwards by the number of p=
tes that
> >> were serviced, including:
> >
> > I probably have failed to describe my question. i'd like to give a
> > concrete example
> >
> > 1. process A forks B
> > 2. At the beginning, address~address +64KB has pte_none PTEs
> > 3. we scan the 5th pte of address + 5 * 4KB, progress becomes 32, we
> > break and release PTLs
> > 4. another page fault in process A gets PTL and set
> > address~address+64KB to pte_cont
> > 5. we get the PTL again and arrive 5th pte
> > 6. we set wrprotects on 5,6,7....15 ptes, in this case, we have to
> > unfold parent A
> > and child B also gets unfolded PTEs unless our loop can go back the 0th=
 pte.
> >
> > technically, A should be able to keep CONTPTE, but because of the imple=
mentation
> > of the code, it can't. That is the sadness. but it is obviously not you=
r fault.
> >
> > no worries. This is not happening quite often. but i just want to make =
a note
> > here, maybe someday we can get back to address it.
>
> Ahh, I understand the situation now, sorry for being slow!
>
> I expect this to be a very rare situation anyway since (4) suggests proce=
ss A
> has another thread, and forking is not encouraged for multithreaded progr=
ams. In
> fact the fork man page says:
>
>   After a fork() in a multithreaded program, the child can safely call on=
ly
>   async-signal-safe functions (see signal-safety(7)) until such time as i=
t calls
>   execve(2).
>
> So in this case, we are about to completely repaint the child's address s=
pace
> with execve() anyway.
>
> So its just the racing parent that loses the CONT_PTE bit. I expect this =
to be
> extremely rare. I'm not sure there is much we can do to solve it though, =
because
> unlike with your solution, we have to cater for multiple sizes so there i=
s no
> obvious boarder until we get to PMD size and I'm guessing that's going to=
 be a
> problem for latency spikes.

right. i don't think this can be a big problem. the background is that
we have some
way to constantly detect and report unexpected unfold/events, so we run hun=
dreds
of phones in lab, and collect data to find out if we have any
potential problems. we
record unexpected unfold and reasons in a proc file, we monitor those data =
to
look for potential bugs we might have. this ">=3D32" break and unfold was f=
ound in
this way, thus we simply addressed it by disallowing the break at an unalig=
ned
address.

This is never an issue which can stop your patchset. but I was still
glad to share
our observations with you and would like to hear if you had any idea :-)

>
>
> >
> >>
> >> progress +=3D 8 * ret;
> >>
> >> That might go above 32, so we will flash the lock. But we haven't done=
 that in
> >> the middle of a large folio. So the contpte-ness should be preserved.
> >>
> >>>
> >>> Our code[1] resolves this by only breaking at the aligned address
> >>>
> >>> if (progress >=3D 32) {
> >>>      progress =3D 0;
> >>>      #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>      /*
> >>>       * XXX: don't release ptl at an unligned address as cont_pte
> >>> might form while
> >>>       * ptl is released, this causes double-map
> >>>      */
> >>>     if (!vma_is_chp_anonymous(src_vma) ||
> >>>         (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
> >>> HPAGE_CONT_PTE_SIZE)))
> >>>     #endif
> >>>         if (need_resched() ||
> >>>            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> >>>              break;
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/=
oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1180

Thanks
Barry