From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1135C5AE59 for ; Tue, 3 Jun 2025 11:49:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F2F9C6B041B; Tue, 3 Jun 2025 07:49:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EE0586B041C; Tue, 3 Jun 2025 07:49:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DCFFD6B041D; Tue, 3 Jun 2025 07:49:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id BE3B76B041B for ; Tue, 3 Jun 2025 07:49:12 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 0874A1206B4 for ; Tue, 3 Jun 2025 11:49:12 +0000 (UTC) X-FDA: 83513918544.14.12D1C90 Received: from mail-lj1-f178.google.com (mail-lj1-f178.google.com [209.85.208.178]) by imf07.hostedemail.com (Postfix) with ESMTP id 0AF1740010 for ; Tue, 3 Jun 2025 11:49:09 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FHv4AJIj; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748951350; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CDRcA+T6Fi9CfWPBz64dW6ekkPgHNPPDc7aXpEK7++Y=; b=tdBA6w/xvaUh9GRz4EuKWiEzcVyN0CUxw/CDlgldT6PtF1Z9RzNDCIGjy8Nqyt+T3ydA84 TzRa1knhY6/Fzvl3vf1ilehDFNtry5y1zUFQlJ7uuYbUE+AdI6XSpMUqeySqAfDPU7gBi1 24B7WSTOcOnY3Ug9N355l4OIr5pXrC0= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FHv4AJIj; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748951350; a=rsa-sha256; cv=none; b=N2YK3F354OrRYXSszpHLWVYpi0+FI/9NpImYn7KXiQkSj22IvSuZYGct2oAfUpA42hUULL V+JAEMeXjJDfikmL+1k/KJyw8Hcd5FgjHVfC4OZzwftHkz3Rh2zgZxZrbEJJwjn7ufhA8E Th9ba2vE0H4uK/L1X17EsjuA8UejY5c= Received: by mail-lj1-f178.google.com with SMTP id 38308e7fff4ca-32a6083ef65so47134231fa.2 for ; Tue, 03 Jun 2025 04:49:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748951348; x=1749556148; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=CDRcA+T6Fi9CfWPBz64dW6ekkPgHNPPDc7aXpEK7++Y=; b=FHv4AJIjNtd3hPxqPuGf2GcVuf1Lz54726mhyEDzi3gmpyYIiVcrjBtZzhsSquq5rA KzEcy9YCGsBSwnyCAfweheBki/F5mv7R2wyQ0OZieGZPVVE5VrZ1heSIwLsSJIPIWSa4 Wuj8YCRFjLyhhynUgIH7ORLI0mj14k3lzN1hxUEx8qyv78MNkJlUIw5fClZh8ZRAVUoY jdABZxExUW1gNeBXh5UimkUTDkhm0jwVHZ0zNrjaduSZlo0M9d6ZSz3AYyvRSn7XKtLp qXX7q6FrUkMozN7JPeh4p6kkCKPWfxyg9VfXddXzkk+uqApWK42RBwosjhYJuSBgvRLu L67Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748951348; x=1749556148; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CDRcA+T6Fi9CfWPBz64dW6ekkPgHNPPDc7aXpEK7++Y=; b=unNNskeYXFNWZ0ND34o2YtdBla/rJRrNBzqR0cMGL6yqm+oB9sWmnv69AD6siSQf5/ RrJ45LWfg/oI6ha/Pc5p+PvRQYcsx5xA7c/nGjs+QwH9zNNeCgyMMRmQ+e0HqUd7GidI o4L7Z8rSFYmw0hPuxcppOZgypoSWHELRS+Gmafs0/PPydo1jS233RqCtTNHRM4Ud1UWO EiH267Bs9nW8zOov3C5BrvUTDuEAwqn3R4N0RFjEUDlI+yvhKscSSkIEfTRdVFD9MsLN bWdpO297xEt9uyUiQYGvO+KukX4B2xLsTJZ72tSrXtHIzQzSQwi4SnBTFkJmcKwk837W RoNw== X-Forwarded-Encrypted: i=1; AJvYcCU+12a5cBzdSWSwsSZUKgxpXriu3qjfshoBzbJRG7u8KDJbJOk6HB0Hda42zOMPKqUb5XlE9x361w==@kvack.org X-Gm-Message-State: AOJu0Yws4DL6InkpxojF02tdObk225k4Ha532bukmP9wZutd9IVLssfG TjagAFdkfLcBbN0jKYZWUTPtl/yh07ahu36E7AuYyeeS/Tg8vz/8JlGDvY+YYZixc2yI6VZaw6G YuopqTrTfS72ymvApyClf06Uk5e+id08= X-Gm-Gg: ASbGncsmuW4H+5snSOB5G1f4kf3El/71qSjkUGMN0ra+iTDMv7Hd4sO+YZuuY8aQ5DB NiRXTncDIYlUV6TRO+dfEypqOeyXhBNhgjMpujq2JUVj8F94OFmfgroG42POhbE4oBRck5cZzzL bOvbsRa15rXYQc/S/5jRp2inRoTaMq9ZjOhQIGBZwPPT0= X-Google-Smtp-Source: AGHT+IGM/SASzWhhUTrY3uUzMWyJSMweqBgIdXZyU0E7noNfgJkGiCnAdF8at5x8bukX8dJz5+IH+N1p4N9YAoNkADk= X-Received: by 2002:a2e:bc19:0:b0:32a:7a2d:a589 with SMTP id 38308e7fff4ca-32a906cec93mr41852711fa.13.1748951347781; Tue, 03 Jun 2025 04:49:07 -0700 (PDT) MIME-Version: 1.0 References: <20250602181419.20478-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 3 Jun 2025 19:48:49 +0800 X-Gm-Features: AX0GCFvpZgb4jbkGTHBdzZSITw6Z_SdDPq42ADh9_afW-sX_weoyjolG37LBaHQ Message-ID: Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Barry Song <21cnbao@gmail.com> Cc: Peter Xu , linux-mm@kvack.org, Andrew Morton , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: pbbn3zpszubhxf8aqyqof3dg3d77zhwo X-Rspamd-Queue-Id: 0AF1740010 X-Rspamd-Server: rspam11 X-HE-Tag: 1748951349-59583 X-HE-Meta: U2FsdGVkX1+6L+N5zj0hXec2mvRpDRL3msvDDLo47oiqRFT1mY6Ri7MiRIMM60pKf6kpRVqtNr30Wo2laftgC/aUN2e68IFhEthAmlcavj4mRDBkyJOJzAPs47my5LVVMSjLMHlCLMZ0K8JRTVz6NvWsdga0kFs+Mht1muM3fV3ZdinUaqzCrdI/RItSdLaaiezaN26fE2VU66XGtQjoXeK9ZOtY4Es/WiE+d1wX2BnJUiR1oPZnbHJe/RRQngMyCfBi0gwgL73MU3+H5APKuqen2gCFyQqJqur1MMDxr3IdgJ4+2JQSIb29M0YYUqK2tZZ/s3iWPw48P9OXqhBAIQkG03fYV9zaW+oA2X5GsMKHuHHe8ejxcKqVfrxo1LmrsR5ZzQx1OyJ1FsraKa7uuuV0OvOwHv7qKG8b/OrrGpPOttIOs52Zl92qN7k9FYXMyl878PPJ7AoOO0ZtwNt94l1f/FEx20IqG5nh8cW0HTdveWWRp74RBo2ZY3L5ZBSOvirCgwmoiaRfzfsL8DPkKs/3HUxEZ8P2K+xaqP34SZ7i1ha3Igt7O1v9mdiNbQeKj3fMWPzPVXCV4jssBqbX1tE4BFfxujxlJa+IL2BSbHiUb3E/1I1EFCdsjZ+/YaTcgxsSBBt6W31AhLMeckj4quPOMA5dFdde5tw4LzkXUyk/MyAyI574vu55wTLkAHx2IWVtYBE4oh1v0OthoGhOrZ5pGRlKU8CubZMNnVW0PPEx1jtTs/T8/hmdY8r4HJai5oljEHCNbtDmuKnfbM3SQLXWluo7VkULyDtph3Jopu4D3eJMlLiMoGEG6x5p/IuD0N3+zmXFWOAq88VbMJls63ZCwMOHPH5d3aE33pMoXUb1och306bsjUVGAbO5fO2NH5xDqyqXvUhG9jn0PMbSQ7hrF7n60I8bzu47KTf3x56gzKOjox9WvtS5/nsVocvN59Sb9XRsIMtRbvKO4C/ fcmGzpAO UBrer0dUuvBJHzPXUlcsGzq0THATeWsJB5G3TxdoZZtT9M6bXhA+kO0q5XjlILYK4S4H+id1AzwEsdtLSrSD1JhLzI/q/ZarEeLUijtqXf7lFjBkXcYHFA+/myuyB7bhmClvDRVsz7cvn+ms89S2i6fqytaNlZ4nRONXckmKKHSy5gJCaZmsJhnNDFI0+OyG7M0Hot4UCfvS+7SNesuhwoMEjEF3hUDp3+kYG9LTgWuW7jiaGq/sPqAV/Y9tKK8hZLwKxxEAtLKl8/cjI7pYm+y486Ca6Qq9mgOkvF0DziBmVT3SiaHLU+Y/9mAfEfyioVrbvK0FTYGzEplJoGSWP1DShUMpc+JhBKe8uUs70zaNEW7+dw/fXin44d9PKrdKWraX4WJ+a8g0Q0P3Q4eX+IyjjGnbaO7Fwy5Q3u2B3sYKtNkfjtp9Q9iLAsRcj/b3ZzFbE4QrkLdNBD1k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 6:08=E2=80=AFAM Barry Song <21cnbao@gmail.com> wrote= : > > On Tue, Jun 3, 2025 at 8:34=E2=80=AFAM Peter Xu wrote= : > > > > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote: > > > From: Kairui Song > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cac= he > > > lookup, and try to move the found folio to the faulting vma when. > > > Currently, it relies on the PTE value check to ensure the moved folio > > > still belongs to the src swap entry, which turns out is not reliable. > > > > > > While working and reviewing the swap table series with Barry, followi= ng > > > existing race is observed and reproduced [1]: > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache= .) > > > > > > CPU1 CPU2 > > > userfaultfd_move > > > move_pages_pte() > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > // Here it got entry =3D S1 > > > ... < Somehow interrupted> ... > > > > > > // folio A is just a new allocated= folio > > > // and get installed into src_pte > > > > > > // src_pte now points to folio A, = S1 > > > // has swap count =3D=3D 0, it can= be freed > > > // by folio_swap_swap or swap > > > // allocator's reclaim. > > > > > > // folio B is a folio in another V= MA. > > > > > > // S1 is freed, folio B could use = it > > > // for swap out with no problem. > > > ... > > > folio =3D filemap_get_folio(S1) > > > // Got folio B here !!! > > > ... < Somehow interrupted again> ... > > > > > > // Now S1 is free to be used again= . > > > > > > // Now src_pte is a swap entry pte > > > // holding S1 again. > > > folio_trylock(folio) > > > move_swap_pte > > > double_pt_lock > > > is_pte_pages_stable > > > // Check passed because src_pte =3D=3D S1 > > > folio_move_anon_rmap(...) > > > // Moved invalid folio B here !!! > > > > > > The race window is very short and requires multiple collisions of > > > multiple rare events, so it's very unlikely to happen, but with a > > > deliberately constructed reproducer and increased time window, it can= be > > > reproduced [1]. > > > > > > It's also possible that folio (A) is swapped in, and swapped out agai= n > > > after the filemap_get_folio lookup, in such case folio (A) may stay i= n > > > swap cache so it needs to be moved too. In this case we should also t= ry > > > again so kernel won't miss a folio move. > > > > > > Fix this by checking if the folio is the valid swap cache folio after > > > acquiring the folio lock, and checking the swap cache again after > > > acquiring the src_pte lock. > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so fa= r > > > we don't need to worry about that since folios only might get exposed= to > > > swap cache in the swap out path, and it's covered in this patch too b= y > > > checking the swap cache again after acquiring src_pte lock. > > > > [1] > > > > > > > > Testing with a simple C program to allocate and move several GB of me= mory > > > did not show any observable performance change. > > > > > > Cc: > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCz= i+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > Signed-off-by: Kairui Song > > > > > > --- > > > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gm= ail.com/ > > > Changes: > > > - Check swap_map instead of doing a filemap lookup after acquiring th= e > > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh= Gidra ] > > > > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gm= ail.com/ > > > Changes: > > > - Move the folio and swap check inside move_swap_pte to avoid skippin= g > > > the check and potential overhead [ Lokesh Gidra ] > > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to da= ted > > > value. > > > > > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index bc473ad21202..5dc05346e360 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm,= struct vm_area_struct *dst_vma, > > > pte_t orig_dst_pte, pte_t orig_src_pte, > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > - struct folio *src_folio) > > > + struct folio *src_folio, > > > + struct swap_info_struct *si, swp_entry_t entry= ) > > > { > > > + /* > > > + * Check if the folio still belongs to the target swap entry af= ter > > > + * acquiring the lock. Folio can be freed in the swap cache whi= le > > > + * not locked. > > > + */ > > > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > > > + entry.val !=3D src_folio->swap.val)) > > > + return -EAGAIN; > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm,= struct vm_area_struct *dst_vma, > > > if (src_folio) { > > > folio_move_anon_rmap(src_folio, dst_vma); > > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > > + } else { > > > + /* > > > + * Check if the swap entry is cached after acquiring th= e src_pte > > > + * lock. Or we might miss a new loaded swap cache folio= . > > > + */ > > > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_H= AS_CACHE) { > > > > Do we need data_race() for this, if this is an intentionally lockless r= ead? > > Not entirely sure. But I recommend this pattern, borrowed from > zap_nonpresent_ptes() -> free_swap_and_cache_nr(), > where the PTL is also held and READ_ONCE is used. > > if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE= ) { > .. > nr =3D __try_to_reclaim_swap(si, offset, > TTRS_UNMAPPED | TTRS_F= ULL); > > if (nr =3D=3D 0) > nr =3D 1; > else if (nr < 0) > nr =3D -nr; > nr =3D ALIGN(offset + 1, nr) - offset; > } Thanks for the explanation, I also agree that holding PTL here is good enough here. > > I think we could use this to further optimize the existing > filemap_get_folio(), since in the vast majority of cases we don't > have a swapcache, yet we still always call filemap_get_folio(). > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..c527ec73c3b4 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > > @@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > * folios in the swapcache. This issue needs to be resolv= ed > * separately to allow proper handling. > */ > > - if (!src_folio) > + if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE)) > folio =3D filemap_get_folio(swap_address_space(en= try), > swap_cache_index(entry)); > if (!IS_ERR_OR_NULL(folio)) { > > To be future-proof, we may want to keep the READ_ONCE to ensure > the compiler doesn't skip the second read inside move_swap_pte(). Maybe we can do this optimization in another patch I think. > > > > > Another pure swap question: the comment seems to imply this whole thing= is > > protected by src_pte lock, but is it? > > > > I'm not familiar enough with swap code, but it looks to me the folio ca= n be > > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as= the > > folio is locked. It doesn't seem to be directly protected by pgtable l= ock. > > > > Perhaps you meant this: since src_pte lock is held, then it'll serializ= e > > with another thread B concurrently swap-in the swap entry, but only _la= ter_ > > when thread B's do_swap_page() will check again on pte_same(), then it'= ll > > see the src pte gone (after thread A uffdio_move happened releasing src= _pte > > lock), hence thread B will release the newly allocated swap cache folio= ? > > > > There's another trivial detail that IIUC pte_same() must fail because > > before/after the uffdio_move the swap entry will be occupied so no way = to > > have it reused, hence src_pte, even if re-populated again after uffdio_= move > > succeeded, cannot become the orig_pte (points to the swap entry in > > question) that thread B read, hence pte_same() must check fail. > > in v1 of this patch, we had some similar discussions [1][2]: > > [1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=3DiohbYf= 64aYyJ55t0Z11FkwA@mail.gmail.com/ > [2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp= =3DArBuv3cJijHVXe6w@mail.gmail.com/ > > At the very least, [2] is possible, although the probability is extremely= low. > > "It seems also possible for the sync zRAM device. > > step 1: src pte points to a swap entry S without swapcache > step 2: we call move_swap_pte() > step 3: someone swap-in src_pte by sync path, no swapcache; swap slot > S is freed. > -- for zRAM; > step 4: someone swap-out src_pte, get the exactly same swap slot S as st= ep 1, > adds folio to swapcache due to swapout; > step 5: move_swap_pte() gets ptl and finds page tables are stable > since swap-out > happens to have the same swap slot as step1; > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the = folio. > > Yep, we really need to re-check pte for swapcache after holding PTL. > " > > Personally, I agree that improving the changelog or the comments > would be more helpful. In fact, there are two bugs here, and Kairui=E2=80= =99s > changelog clearly describes the first one. Yeah, the first one is quite a long and complex race already, so I made the description on the second issue shorter. I thought it wouldn't be too difficult to understand given the first example. I can add some more details.