From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A88E2EB64D9
	for <linux-mm@archiver.kernel.org>; Thu, 29 Jun 2023 17:07:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CB4248D0002; Thu, 29 Jun 2023 13:07:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C63BE8D0001; Thu, 29 Jun 2023 13:07:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B2B6D8D0002; Thu, 29 Jun 2023 13:07:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id A191E8D0001
	for <linux-mm@kvack.org>; Thu, 29 Jun 2023 13:07:06 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 624CE1A0261
	for <linux-mm@kvack.org>; Thu, 29 Jun 2023 17:07:06 +0000 (UTC)
X-FDA: 80956415652.17.74C190C
Received: from mail-pf1-f170.google.com (mail-pf1-f170.google.com [209.85.210.170])
	by imf07.hostedemail.com (Postfix) with ESMTP id F2CE240131
	for <linux-mm@kvack.org>; Thu, 29 Jun 2023 17:05:39 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20221208 header.b=bG0qwUOq;
	spf=pass (imf07.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1688058341;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ZS1h+GLsux7NMVsZtF2S2RchX2UslLK1KYof85n+JhM=;
	b=zqiMbjLK6/wXowSY6m3x5IFRdNG+BVdzWg5KucqbMCL1uZA/mkiU6vbR6HBPh8lG3cBsoC
	if++QONS1mBSsbOTiSre1lk7Gw6S7cnwYtl29rzodvq5fjO2zhUweC4tEXCYmmUXXbLjaB
	SZ1i82+2skHavqQuzpr1le6oxQ/qB70=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688058341; a=rsa-sha256;
	cv=none;
	b=p9A2jc8iDC1tw3B/0eusbwIZrOoTY9WObo1KRZ9jiEpuGbwe6EhS/9X82jSEZHY75/ajNP
	jGGVR/o/4pGl7VD+aiso8tW10WxBVrseV5BrQhx8qbG4sg0qrCgUxVW40NLWSkx9NJt/v/
	H9d8fekM8KaPVbnZRbcIk19Hj/j5zT8=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20221208 header.b=bG0qwUOq;
	spf=pass (imf07.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.170 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pf1-f170.google.com with SMTP id d2e1a72fcca58-666ecb21f86so858172b3a.3
        for <linux-mm@kvack.org>; Thu, 29 Jun 2023 10:05:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1688058339; x=1690650339;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ZS1h+GLsux7NMVsZtF2S2RchX2UslLK1KYof85n+JhM=;
        b=bG0qwUOqWDPE+kRS2qLCQY6nlFST2gYe0VXrsvsG+6YmDFnsDxlbpYErzclQdtyfpx
         AcYxR1lGeJu17F6rGiZeJ03nN3yxh/NsKmMd9D3LNsiQLyn+5o8a4y188zAE7KtNbEqu
         xYxodMxtyPBMUQemobiJdnEnpDR3a7aEx639VY4SkntkNJG4Vx6JKsf3BuR4lIo0VkRU
         i5aJ7zIuzoct9EIb+tLnJXdE2Ys4/UfTfPvjimgM1Jt4QeHNX6OCjfa9Qh2Jbl3joIT5
         WAw1bVfqpzf2WHwIO7tQFsgG3Euir3haadRHKKaEQGf/xpPYVRsRyuMx3Wg9rAr+rW4v
         Fz/A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1688058339; x=1690650339;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ZS1h+GLsux7NMVsZtF2S2RchX2UslLK1KYof85n+JhM=;
        b=X4nsh7OEdMwVkbnH8FxnaO27wXCVkMUb3Qrp8BVMtA9gd1VM1PRpAvcU2WEU4sRsJB
         VhtFM9wrM9leZXs4FV5AXXku9hiBPtb0PlCJKNzGf3jhCjT1g7YfctJIlOiqjei7N1HR
         dVe7OB6Zxh8b+bKn3Zty6uwCOZckwug15htz/aXZLYU76l8hh3jhPMCcv+IhRa7SfY71
         2UBZ9l/y5eg2/rOLhjYYQ5k1htX7U0n0+AF9JeTYqi/qV0CiPpEGh3gkA70y//0HI2IT
         CbNVMoQrS/ZbBZ6sgKKH2sooU/aNkP7GKybuCl3ZOqoKxWyY21JlS8X4hNe9k2QQSNT3
         W8hw==
X-Gm-Message-State: ABy/qLaDXapouWAq8gdpfJE0AF77/VUJ7Pwd4IcbACGXL2zdyEifrKSU
	rS97HC4kzYKltfEr9pe/nT2uq8RdUjWp40deBAw=
X-Google-Smtp-Source: APBJJlF50Q1V5JoExNYTh3LhEFoqNDfSFaN+MYQk8OvzMfcUOza8XaCmFa7u3rQ9C1bNO0iXnHoWCKksQ0Rv4uCfRx4=
X-Received: by 2002:a05:6a00:801:b0:682:4edf:b9c7 with SMTP id
 m1-20020a056a00080100b006824edfb9c7mr646421pfk.2.1688058338518; Thu, 29 Jun
 2023 10:05:38 -0700 (PDT)
MIME-Version: 1.0
References: <20230626171430.3167004-1-ryan.roberts@arm.com>
 <20230626171430.3167004-11-ryan.roberts@arm.com> <CAHbLzkoZf=5PLrC0HNT4Owy=xM4HfA9HDjdCHC+h5RYXUwp0gw@mail.gmail.com>
 <bfd1de51-d445-5861-f69f-13d740651f8e@arm.com>
In-Reply-To: <bfd1de51-d445-5861-f69f-13d740651f8e@arm.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 29 Jun 2023 10:05:26 -0700
Message-ID: <CAHbLzkqKSoncVUqj7vY-Yt6H5=x-uLfDNJhRSRssGK5sKu5Ckg@mail.gmail.com>
Subject: Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, 
	"Matthew Wilcox (Oracle)" <willy@infradead.org>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Yin Fengwei <fengwei.yin@intel.com>, 
	David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>, 
	Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, 
	Geert Uytterhoeven <geert@linux-m68k.org>, Christian Borntraeger <borntraeger@linux.ibm.com>, 
	Sven Schnelle <svens@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, 
	Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, 
	"H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, 
	linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, 
	linux-s390@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: 7m98zdkunyzkygtw8f7cmnhut31woxkh
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: F2CE240131
X-Rspam-User: 
X-HE-Tag: 1688058339-992807
X-HE-Meta: U2FsdGVkX1+X/gSjxQ9JS8dFqzfV5fw7P2c/1kFwmOJVrOGeTM//yJUMJe2AKFsQMcnBB6bvLikrqVORoo+XwgtYd+Qb/VyiVh4QuVxS4U8eT0HZSmBdWXT2FKLZU5Sytd6obaGTawEm18FjC+McUD8ky+6wr7DVurHGa6UNYpoyUab3PbEZp1Xr51dF/AOuR9xY2wKTm+CKwHPDv/OCt3B44tWPI81KtNYnANaxViw0VHHNq9wf76OrlgBmPR5MHI+bZgNxC+mP5RW1r4QC7VeonftVxL5tQFoUANAUz726oxtaEa42ObYusoMXzIPLzfY/vgH+1AUGfK5Ye8yQS4zt0Z56eXV6Ow+3/oLIUTasHbPQTVGJZcUzbfYC8xnHqXZ9w3vWBxQiS5N+GkkeSSkqYJKRviyY9I1k5J1JRJ/5RFhIcTRVYnDodxp9dNLBouCyUrwpztCrLKQH2xms8bMEJpzMNRexNftIDGmi/amq4yphEap9mSiZflhg2wF/S2lC1b/bAaMtSo9alDi7/KKKLc1nzQHsmVkuqpgrJIWeS/Nokx58hiI2BkvKAnhYbLIj/MJuJ9g7JfDGSOrCR252bzYfsqhhOzLH2xhqRXA36X1lzjWrgAVdnkaUvKjlzohPwAYKaQ4Y8VVGInX9jSo4CHUdseLuyL8LTeY6ih6h389wEk+kt8v0VGokqoQcpNqjyzCCUXjoE8t2ghuGOKClExehL2wUlx6VBZ/RyOBCOXbgUiolg9PeiX1hmtJC2/jIUsqcMHInSYqTTSvbu7ledgYWc+6dFz1ko4gfpk6Ugcy0bKeWzRlUZBC7pti/38oOq4UGUDtVS0Z2Iosai2hgHpNpfJ0BV6NSHtbiWOV07vC+HYwe2qu5L8oiRa6c8Q2pZcPpxZr9D0qZy4Py5vvxsTiSNx6diX7Poo4LckYSz0vBAUqB1LnR+tZLLfc0IxnpauBsJ75xgeCCJTv
 +Zr8YWxq
 h08QD7cGi+JV4DbEe4KV6j3PHYwaIsL97oObpT40ria22KKCgUBxTTdZFqxcK1Ll8AL2JYGFFYxAAGXlrgQDwiA/yewKCowY4cwFOJZEbB8NIexALmdoHWnBtxJe3pDzMzI4Ml3jwLCnK1kLZBW3OSQVTBjTbFed0RJUmTlNc4xMGl9wnCSewaPxuytKjYrytSVNp/TSV9V5cM9FUL79Zno8x1IgnLl6O1/ROGvGXYvaMi5acqVFwB1Kz063jilOZLpN8Za6gbXLMUKjyZtr2b91DOifjZTnHDGoOpRSFcwGzsEp814bvPM1aevHKdUg902bc0uxn6ylw3hZFdXVblVbfITUz9+4/Dxds5zpD5ni0T/zZSWaxVYAbBBG/Vq98Yukwi28UMefAabjSsD5bAQN2D6DXErSolD+Y
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jun 29, 2023 at 4:30=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 29/06/2023 03:13, Yang Shi wrote:
> > On Mon, Jun 26, 2023 at 10:15=E2=80=AFAM Ryan Roberts <ryan.roberts@arm=
.com> wrote:
> >>
> >> With all of the enabler patches in place, modify the anonymous memory
> >> write allocation path so that it opportunistically attempts to allocat=
e
> >> a large folio up to `max_anon_folio_order()` size (This value is
> >> ultimately configured by the architecture). This reduces the number of
> >> page faults, reduces the size of (e.g. LRU) lists, and generally
> >> improves performance by batching what were per-page operations into
> >> per-(large)-folio operations.
> >>
> >> If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then
> >> `max_anon_folio_order()` always returns 0, meaning we get the existing
> >> allocation behaviour.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++----=
-
> >>  1 file changed, 144 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index a8f7e2b28d7a..d23c44cc5092 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct v=
m_area_struct *vma)
> >>                 return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX;
> >>  }
> >>
> >> +/*
> >> + * Returns index of first pte that is not none, or nr if all are none=
.
> >> + */
> >> +static inline int check_ptes_none(pte_t *pte, int nr)
> >> +{
> >> +       int i;
> >> +
> >> +       for (i =3D 0; i < nr; i++) {
> >> +               if (!pte_none(ptep_get(pte++)))
> >> +                       return i;
> >> +       }
> >> +
> >> +       return nr;
> >> +}
> >> +
> >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int orde=
r)
> >> +{
> >> +       /*
> >> +        * The aim here is to determine what size of folio we should a=
llocate
> >> +        * for this fault. Factors include:
> >> +        * - Order must not be higher than `order` upon entry
> >> +        * - Folio must be naturally aligned within VA space
> >> +        * - Folio must not breach boundaries of vma
> >> +        * - Folio must be fully contained inside one pmd entry
> >> +        * - Folio must not overlap any non-none ptes
> >> +        *
> >> +        * Additionally, we do not allow order-1 since this breaks ass=
umptions
> >> +        * elsewhere in the mm; THP pages must be at least order-2 (si=
nce they
> >> +        * store state up to the 3rd struct page subpage), and these p=
ages must
> >> +        * be THP in order to correctly use pre-existing THP infrastru=
cture such
> >> +        * as folio_split().
> >> +        *
> >> +        * As a consequence of relying on the THP infrastructure, if t=
he system
> >> +        * does not support THP, we always fallback to order-0.
> >> +        *
> >> +        * Note that the caller may or may not choose to lock the pte.=
 If
> >> +        * unlocked, the calculation should be considered an estimate =
that will
> >> +        * need to be validated under the lock.
> >> +        */
> >> +
> >> +       struct vm_area_struct *vma =3D vmf->vma;
> >> +       int nr;
> >> +       unsigned long addr;
> >> +       pte_t *pte;
> >> +       pte_t *first_set =3D NULL;
> >> +       int ret;
> >> +
> >> +       if (has_transparent_hugepage()) {
> >> +               order =3D min(order, PMD_SHIFT - PAGE_SHIFT);
> >> +
> >> +               for (; order > 1; order--) {
> >> +                       nr =3D 1 << order;
> >> +                       addr =3D ALIGN_DOWN(vmf->address, nr << PAGE_S=
HIFT);
> >> +                       pte =3D vmf->pte - ((vmf->address - addr) >> P=
AGE_SHIFT);
> >> +
> >> +                       /* Check vma bounds. */
> >> +                       if (addr < vma->vm_start ||
> >> +                           addr + (nr << PAGE_SHIFT) > vma->vm_end)
> >> +                               continue;
> >> +
> >> +                       /* Ptes covered by order already known to be n=
one. */
> >> +                       if (pte + nr <=3D first_set)
> >> +                               break;
> >> +
> >> +                       /* Already found set pte in range covered by o=
rder. */
> >> +                       if (pte <=3D first_set)
> >> +                               continue;
> >> +
> >> +                       /* Need to check if all the ptes are none. */
> >> +                       ret =3D check_ptes_none(pte, nr);
> >> +                       if (ret =3D=3D nr)
> >> +                               break;
> >> +
> >> +                       first_set =3D pte + ret;
> >> +               }
> >> +
> >> +               if (order =3D=3D 1)
> >> +                       order =3D 0;
> >> +       } else
> >> +               order =3D 0;
> >> +
> >> +       return order;
> >> +}
> >> +
> >>  /*
> >>   * Handle write page faults for pages that can be reused in the curre=
nt vma
> >>   *
> >> @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fa=
ult *vmf)
> >>         struct folio *folio;
> >>         vm_fault_t ret =3D 0;
> >>         pte_t entry;
> >> +       unsigned long addr;
> >> +       int order =3D uffd_wp ? 0 : max_anon_folio_order(vma);
> >> +       int pgcount =3D BIT(order);
> >>
> >>         /* File mapping without ->vm_ops ? */
> >>         if (vma->vm_flags & VM_SHARED)
> >> @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_=
fault *vmf)
> >>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>                         return handle_userfault(vmf, VM_UFFD_MISSING);
> >>                 }
> >> -               goto setpte;
> >> +               if (uffd_wp)
> >> +                       entry =3D pte_mkuffd_wp(entry);
> >> +               set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +
> >> +               /* No need to invalidate - it was non-present before *=
/
> >> +               update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +               goto unlock;
> >>         }
> >>
> >> -       /* Allocate our own private page. */
> >> +retry:
> >> +       /*
> >> +        * Estimate the folio order to allocate. We are not under the =
ptl here
> >> +        * so this estiamte needs to be re-checked later once we have =
the lock.
> >> +        */
> >> +       vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address);
> >> +       order =3D calc_anon_folio_order_alloc(vmf, order);
> >> +       pte_unmap(vmf->pte);
> >> +
> >> +       /* Allocate our own private folio. */
> >>         if (unlikely(anon_vma_prepare(vma)))
> >>                 goto oom;
> >> -       folio =3D vma_alloc_zeroed_movable_folio(vma, vmf->address, 0,=
 0);
> >> +       folio =3D try_vma_alloc_movable_folio(vma, vmf->address, order=
, true);
> >>         if (!folio)
> >>                 goto oom;
> >>
> >> +       /* We may have been granted less than we asked for. */
> >> +       order =3D folio_order(folio);
> >> +       pgcount =3D BIT(order);
> >> +       addr =3D ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> >> +
> >>         if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> >>                 goto oom_free_page;
> >>         folio_throttle_swaprate(folio, GFP_KERNEL);
> >>
> >>         /*
> >>          * The memory barrier inside __folio_mark_uptodate makes sure =
that
> >> -        * preceding stores to the page contents become visible before
> >> -        * the set_pte_at() write.
> >> +        * preceding stores to the folio contents become visible befor=
e
> >> +        * the set_ptes() write.
> >>          */
> >>         __folio_mark_uptodate(folio);
> >>
> >> @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_=
fault *vmf)
> >>         if (vma->vm_flags & VM_WRITE)
> >>                 entry =3D pte_mkwrite(pte_mkdirty(entry));
> >>
> >> -       vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->ad=
dress,
> >> -                       &vmf->ptl);
> >> -       if (vmf_pte_changed(vmf)) {
> >> -               update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> -               goto release;
> >> +       vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &=
vmf->ptl);
> >> +
> >> +       /*
> >> +        * Ensure our estimate above is still correct; we could have r=
aced with
> >> +        * another thread to service a fault in the region.
> >> +        */
> >> +       if (order =3D=3D 0) {
> >> +               if (vmf_pte_changed(vmf)) {
> >> +                       update_mmu_tlb(vma, vmf->address, vmf->pte);
> >> +                       goto release;
> >> +               }
> >> +       } else if (check_ptes_none(vmf->pte, pgcount) !=3D pgcount) {
> >> +               pte_t *pte =3D vmf->pte + ((vmf->address - addr) >> PA=
GE_SHIFT);
> >> +
> >> +               /* If faulting pte was allocated by another, exit earl=
y. */
> >> +               if (!pte_none(ptep_get(pte))) {
> >> +                       update_mmu_tlb(vma, vmf->address, pte);
> >> +                       goto release;
> >> +               }
> >> +
> >> +               /* Else try again, with a lower order. */
> >> +               pte_unmap_unlock(vmf->pte, vmf->ptl);
> >> +               folio_put(folio);
> >> +               order--;
> >> +               goto retry;
> >
> > I'm not sure whether this extra fallback logic is worth it or not. Do
> > you have any benchmark data or is it just an arbitrary design choice?
> > If it is just an arbitrary design choice, I'd like to go with the
> > simplest way by just exiting page fault handler, just like the
> > order-0, IMHO.
>
> Yes, its an arbitrary design choice. Based on Yu Zhao's feedback, I'm alr=
eady
> reworking this so that we only try the preferred order and order-0, so no=
 longer
> iterating through intermediate orders.
>
> I think what you are suggesting is that if attempting to allocate the pre=
ferred
> order and we find there was a race meaning that the folio now is overlapp=
ing
> populated ptes (but the faulting pte is still empty), just exit and rely =
on the
> page fault being re-triggered, rather than immediately falling back to or=
der-0?

The faulting PTE might be filled too. Yes, just exit and rely on the
CPU re-trigger page fault.

>
> The reason I didn't do that was I wasn't sure if the return path might ha=
ve
> assumptions that the faulting pte is now valid if no error was returned? =
I guess
> another option is to return VM_FAULT_RETRY but then it seemed cleaner to =
do the
> retry directly here. What do you suggest?

IIRC as long as the page fault handler doesn't return any error, it is
safe to rely on CPU re-trigger page fault if PTE is not installed.

VM_FAULT_RETRY means the page fault handler released mmap_lock (or
per-VMA lock with per-VMA lock enabled) due to waiting for page lock.
TBH I really don't want to make that semantic more complicated and
overloaded. And I don't see any fundamental difference between
vmf_pte_changed() for order-0 folio and overlapping PTEs installed for
large folio. So I'd like to follow the same behavior.

>
> Thanks,
> Ryan
>
>
>
> >
> >>         }
> >>
> >>         ret =3D check_stable_address_space(vma->vm_mm);
> >> @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_=
fault *vmf)
> >>                 return handle_userfault(vmf, VM_UFFD_MISSING);
> >>         }
> >>
> >> -       inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> >> -       folio_add_new_anon_rmap(folio, vma, vmf->address);
> >> +       folio_ref_add(folio, pgcount - 1);
> >> +
> >> +       add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> >> +       folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vm=
a, addr);
> >>         folio_add_lru_vma(folio, vma);
> >> -setpte:
> >> +
> >>         if (uffd_wp)
> >>                 entry =3D pte_mkuffd_wp(entry);
> >> -       set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >> +       set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
> >>
> >>         /* No need to invalidate - it was non-present before */
> >> -       update_mmu_cache(vma, vmf->address, vmf->pte);
> >> +       update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> >>  unlock:
> >>         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>         return ret;
> >> --
> >> 2.25.1
> >>
> >>
>