From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1ADF6C54F30 for ; Tue, 27 May 2025 09:31:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 707536B0093; Tue, 27 May 2025 05:31:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B7C96B0098; Tue, 27 May 2025 05:31:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CDEA6B0099; Tue, 27 May 2025 05:31:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3E3856B0093 for ; Tue, 27 May 2025 05:31:20 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id D17F6140184 for ; Tue, 27 May 2025 09:31:19 +0000 (UTC) X-FDA: 83488169478.23.15A5BE5 Received: from mail-vk1-f170.google.com (mail-vk1-f170.google.com [209.85.221.170]) by imf21.hostedemail.com (Postfix) with ESMTP id 1DF1D1C000B for ; Tue, 27 May 2025 09:31:17 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f3UlWbRQ; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748338278; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=97+T8oC1dJwTDjXMcPA6pmXDbFR9V/YrjWwp0CCMtPs=; b=fKRmMqKlOzdzc+kxXN1mXEJScUeMVPKg2Y58UM+opCCN3bnmtkOQeT4/eGWXJWycr1IUkj SiK+wTUySsetDIH5sNOBauNXcEQAk+IY+1q0XleBSjbdJV/WyFZoqr3QxWF4gc57AP12xS UTCviMzlyqVu1WuksyEG72ISvV0P92o= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=f3UlWbRQ; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748338278; a=rsa-sha256; cv=none; b=u/zZqNdYCMDjOsb7y+V5xEKo2PI5bn9xN3dIhBltXWyShqpkjmp8mFA6PObdDbNhCS3x0S VKrmzWvdYQK8ZDPMVb0DTc87sP3o0sugO164+KF/m5YmgKMYYaq/totyxfIoEku46wb5YW ysg7qBhV4XVbzaNQh+FrwCbyosEDS4Y= Received: by mail-vk1-f170.google.com with SMTP id 71dfb90a1353d-5240a432462so1927619e0c.1 for ; Tue, 27 May 2025 02:31:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748338277; x=1748943077; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=97+T8oC1dJwTDjXMcPA6pmXDbFR9V/YrjWwp0CCMtPs=; b=f3UlWbRQMERUp5zqO9rX6wKXiUZlKmFHz7IkY9yYFrj5Y4Pu86HISSFy7I+pc1cb0U Cp1uhY9tQlonxmvm414kK7ngRYcPEfnnifC8poKf2rtwVbNDaODiZVzWdmFye2UxKx7R AwilEOPyJH/hPwY5pm3IE3mACNpyGD9iTxZ86K5WgudYoEy+ib790N6sR73loXqCzEkA BAUnTlCmzkjD8FFqTdhBjEnu0Pd/BFrRfLz8NkFI3NPRTCkLwqkO0judCqcmRL7SOeul MWoQ48ZRwVxq03B7aiX4H8mfq4T8iWt9fjXOl1EOBG6xhJAY2Ko/jVHZgL9Kr1KSdHM4 9MZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748338277; x=1748943077; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=97+T8oC1dJwTDjXMcPA6pmXDbFR9V/YrjWwp0CCMtPs=; b=l0PZjFLiIsvh9w5M917u/bjSNMQLd3SQyZzPVFZMpx1n28FwIe/8Lh7QKz9KXnxseg 0AZZ75fsdRLt0tOUCFqNHuBj6bKgefFWQ6OKzlBxfriOUxKZXO9rjHcjQIiZU/8jBQe7 P2tnEj45TDto1BawZIO4uIP/6khV4bTw8iO6Vz3+kdT4LGp70mSxOp9dS4KRg5/UR30s r4afRyhdyt1Uq8YRsZ2TiXBOZfg1cW6sDxu+FcuSQFIih1LgScEn7UWrAYTB2nUDSuZx u4Rv+ZHNeKViLoARYohOYrw51OQcMXekhhwJt56KN1ZC5EQ539GDrM+3u9F69ACYRWhx oaUQ== X-Forwarded-Encrypted: i=1; AJvYcCVncQBzTs2MfAvREKgfb4JNRQ2zpkPodxczNAmjnKOsNPSDnY8omi2nzr2/Zwosr4YGBsTc2KFBug==@kvack.org X-Gm-Message-State: AOJu0YwwXlcfMmxW84B4gDI4X/zuClVwYdiyHwTyX26m9eNaDcdUL7ZW R0EfTmy0YgORPFgiWWsZFjsU6LICqTLaiDZW1VIKl/ygi5b+gnnwY+MrnREizuzpsP0sldei1qe pkd4gWOTknmzOaMafe6iG5h9nPAKVvkY= X-Gm-Gg: ASbGnctAvrqUTe1LdRe3rvbTYcVR+wYCx6iZVr26QLvv37e0NacSRXFPWgz0lHWXcfx eZEUpAkz5loL7NdnJNFWE5wpEdUCIdEu212QivbqsKXEizIiTnC3Bf+NuO8G0MPrjth7Alja8Mr ZlF9DApOp4GPTMlY1xLAEyvIoDd4nJdt9QHw== X-Google-Smtp-Source: AGHT+IEC6n+Jh7ll5RCcHL45yLJ31qkJq1Ed59X8Pjk9iz0BDBVctTKK8vDXycYmqs2PwBOWNrsQClc91os9m+niJt0= X-Received: by 2002:a05:6122:2223:b0:526:23d2:6ecd with SMTP id 71dfb90a1353d-52f2c025fe4mr8566706e0c.6.1748338277002; Tue, 27 May 2025 02:31:17 -0700 (PDT) MIME-Version: 1.0 References: <20250527083722.27309-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 27 May 2025 17:31:05 +0800 X-Gm-Features: AX0GCFu2pfGa-TJu0at63iJrzVFoQlxp_JxZP5wS7HpNUnTLtTCvgDB2o3KhtmA Message-ID: Subject: Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out To: David Hildenbrand Cc: aarcange@redhat.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, lokeshgidra@google.com, peterx@redhat.com, ryncsn@gmail.com, surenb@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1DF1D1C000B X-Stat-Signature: rwhoc5oj6er6j4d8osaa47w6k8kopwm7 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748338277-270035 X-HE-Meta: U2FsdGVkX18+hHxPkwsbN+Zp14n/2DCwEXk/9BBLoWAmNsiw3ajQruKtM03wHvh+OAbPgphNQRqr0CprfY3BRtLepbbpRFERy01s2HfhoKPK/niSuqKfOcCXCDbcbo+fR9m73THoCvQiFcfHotyz9+sMWO4Xkqcuqtz7r8qOUOpGL6JqUtYMZY78fdI8H3Gxi11KYBXIADiMS0/nZaeWoDtvdQTZFq11ys5D88rnmCWTqdp7wuDXwNXkkO9DXvJgy5rxkIrBBknn1cwA3fzw5ENXOnSOM7onksycQWL1uax3OiI3A3wnzc41qlXImoAjv9epXblPwer9efZanaolNKBiEI/LYyU/QdQ2fpPJtsHRuHNlVqyuXvAJFHmmsLDq04mj8McQ8k/0dQgmPPUHVL0HCo+jrZNmUBauZ4Ww+nu2P9p10Rzv3ccC284F78KMpG9eRAmKEr7AQvMbEGA0yiXcHaBeFXW5HhKx6BeVyP1SCUcfBhmcspsg+H8Xt9+FtHwv32kG/741x85VoDTWV2bPjF93Vgr6r1FlT98hGHTvJgQDGaT9cAP6urPeW/yDbPmeXDtNZMsPytlL0N66oI3v3coWfLjjTnqwUSXPihEItE8J96dJcgXsawtKTZ17B3+7B21w6FNqzCCteo+eaBv9lHscTInbfOGCpBh8a4L97RrSx+Qhgt2C4E2CdgWNRCTfsXlpFp3d0gA8iS+irM8jCj28io7szK/G91LR0km080d0Zc/DSuJVzBsRx/om7p5bUcBuXzzQU147xaRNERnqWxU1mVgJ2LuhrzIbk+17JA+XKY0e+G1XcxSL/gVTfbn0zz0Ru3PxFXeQSsIbrIC/7N8GYMfYKUVNtVT3dCIGL7QtWR0OAU3i028X/HfGsmK4PG7bAGTPUbYfb6z2c9RaZM6Qw29PQRcMxybfILkUlp+N5qoG7YF/tGm94ecP1Z1x/mUhkrNnwjamAxa V1fh0VZx UZ3nCg1oRpExVYGFoG6BkH80dtw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 27, 2025 at 5:00=E2=80=AFPM David Hildenbrand wrote: > > On 27.05.25 10:37, Barry Song wrote: > > On Tue, May 27, 2025 at 4:17=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > >> > >> On Tue, May 27, 2025 at 12:39=E2=80=AFAM David Hildenbrand wrote: > >>> > >>> On 23.05.25 01:23, Barry Song wrote: > >>>> Hi All, > >>> > >>> Hi! > >>> > >>>> > >>>> I'm encountering another bug that can be easily reproduced using the= small > >>>> program below[1], which performs swap-out and swap-in in parallel. > >>>> > >>>> The issue occurs when a folio is being swapped out while it is acces= sed > >>>> concurrently. In this case, do_swap_page() handles the access. Howev= er, > >>>> because the folio is under writeback, do_swap_page() completely remo= ves > >>>> its exclusive attribute. > >>>> > >>>> do_swap_page: > >>>> } else if (exclusive && folio_test_writeback(folio)= && > >>>> data_race(si->flags & SWP_STABLE_WRITES)= ) { > >>>> ... > >>>> exclusive =3D false; > >>>> > >>>> As a result, userfaultfd_move() will return -EBUSY, even though the > >>>> folio is not shared and is in fact exclusively owned. > >>>> > >>>> folio =3D vm_normal_folio(src_vma, src_add= r, > >>>> orig_src_pte); > >>>> if (!folio || !PageAnonExclusive(&folio->p= age)) { > >>>> spin_unlock(src_ptl); > >>>> + pr_err("%s %d folio:%lx exclusive:%d > >>>> swapcache:%d\n", > >>>> + __func__, __LINE__, folio, > >>>> PageAnonExclusive(&folio->page), > >>>> + folio_test_swapcache(folio))= ; > >>>> err =3D -EBUSY; > >>>> goto out; > >>>> } > >>>> > >>>> I understand that shared folios should not be moved. However, in thi= s > >>>> case, the folio is not shared, yet its exclusive flag is not set. > >>>> > >>>> Therefore, I believe PageAnonExclusive is not a reliable indicator o= f > >>>> whether a folio is truly exclusive to a process. > >>> > >>> It is. The flag *not* being set is not a reliable indicator whether i= t > >>> is really shared. ;) > >>> > >>> The reason why we have this PAE workaround (dropping the flag) in pla= ce > >>> is because the page must not be written to (SWP_STABLE_WRITES). CoW > >>> reuse is not possible. > >>> > >>> uffd moving that page -- and in that same process setting it writable= , > >>> see move_present_pte()->pte_mkwrite() -- would be very bad. > >> > >> An alternative approach is to make the folio writable only when we are > >> reasonably certain it is exclusive; otherwise, it remains read-only. I= f the > >> destination is later written to and the folio has become exclusive, it= can > >> be reused directly. If not, a copy-on-write will occur on the destinat= ion > >> address, transparently to userspace. This avoids Lokesh=E2=80=99s user= space-based > >> strategy, which requires forcing a write to the source address. > > > > Conceptually, I mean something like this: > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..70eaabf4f1a3 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm, > > } > > if (folio_test_large(src_folio) || > > folio_maybe_dma_pinned(src_folio) || > > - !PageAnonExclusive(&src_folio->page)) { > > + (!PageAnonExclusive(&src_folio->page) && > > + folio_mapcount(src_folio) !=3D 1)) { > > err =3D -EBUSY; > > goto out; > > } > > @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm, > > #endif > > if (pte_dirty(orig_src_pte)) > > orig_dst_pte =3D pte_mkdirty(orig_dst_pte); > > - orig_dst_pte =3D pte_mkwrite(orig_dst_pte, dst_vma); > > + if (PageAnonExclusive(&src_folio->page)) > > + orig_dst_pte =3D pte_mkwrite(orig_dst_pte, dst_vma); > > > > set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); > > out: > > @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, p= md_t *dst_pmd, pmd_t *src_pmd, > > } > > > > folio =3D vm_normal_folio(src_vma, src_addr, orig= _src_pte); > > - if (!folio || !PageAnonExclusive(&folio->page)) { > > + if (!folio || (!PageAnonExclusive(&folio->page) &= & > > + folio_mapcount(folio) !=3D 1)) { > > spin_unlock(src_ptl); > > err =3D -EBUSY; > > goto out; > > > > I'm not trying to push this approach=E2=80=94unless Lokesh clearly sees= that it > > could reduce userspace noise. I'm mainly just curious how we might make > > the fixup transparent to userspace. :-) > > And that reveals the exact problem: it's all *very* complicated. :) > > ... and dangerous when we use the mapcount without having a complete > understanding how it all works. > > > What we would have to do for a small folio is > > 1) Take the folio lock > > 2) Make sure there is only this present page table mapping: > folio_mapcount(folio) !=3D 1 > > of better > > !folio_maybe_mapped_shared(folio); > > 3) Make sure that there are no swap references > > If in the swapcache, check the actual swapcount > > 3) Make sure it is not a KSM folio > > > THPs are way, way, way more complicated to get right that way. Likely, > the scenario described above cannot happen with a PMD-mapped THP for now > at least (we don't have PMD swap entries). Yeah, this can get really complicated. > > > Of course, we'd then also have to handle the case when we have a swap > pte where the marker is not set (e.g., because of swapout after the > described swapin where we dropped the marker). > > > What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's > arguably less optimal in case the core will decide to swapin / CoW, but > it leaves the magic to get all this right to the core -- and mimics the > approach Lokesh uses. > > But then, maybe letting userspace just do a uffdio_copy would be even > faster (only a single TLB shootdown?). > > > I am also skeptical of calling this a BUG here. It's described to behave > exactly like that [1]: > > EBUSY > The pages in the source virtual memory range are either > pinned or not exclusive to the process. The kernel might > only perform lightweight checks for detecting whether the > pages are exclusive. To make the operation more likely to > succeed, KSM should be disabled, fork() should be avoided > or MADV_DONTFORK should be configured for the source > virtual memory area before fork(). > > Note the "lightweight" and "more likely to succeed". > Initially, my point was that an exclusive folio (single-process case) should be movable. Now I understand this isn=E2=80=99t a bug, but rather a compromise made due to implementation constraints. Perhaps the remaining value of this report is that it helped better understand scenarios beyond fork where a move might also fail. I truly appreciate your time and your clear analysis. > > [1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.c= om/ > > -- > Cheers, > > David / dhildenb > Thanks Barry