From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC1C0C5AE59 for ; Sat, 31 May 2025 06:54:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 454C06B0160; Sat, 31 May 2025 02:54:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 405D46B0163; Sat, 31 May 2025 02:54:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31AAB6B0164; Sat, 31 May 2025 02:54:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0D29A6B0160 for ; Sat, 31 May 2025 02:54:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 55953161208 for ; Sat, 31 May 2025 06:54:28 +0000 (UTC) X-FDA: 83502289416.29.0C09853 Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf19.hostedemail.com (Postfix) with ESMTP id 81EF81A000E for ; Sat, 31 May 2025 06:54:26 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=knRLJTpc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748674466; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7KKSGxiWUusF3NpXnt88RhyaalY1bWPLatpBLtUB9C8=; b=my/sFdnzjF8fep6V2/up1miQCyOXf2D5vaxY+1MY0KFkEYl0nRgdwr69ouZScs1OWELTbJ JbhzImtdB/hWN+z5QPBmn1a3lw2YsAI8GL2mSs6SAMU5699ESEu3EQCWUewKF71lKwINpK myG2HsLVUH8giKx/QOJ5tuUSptrkyxM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748674466; a=rsa-sha256; cv=none; b=TgthTs5MAmrY+J9obFPO5WQEAE4h9lULvFtv1pSyU2mdvUOTAudK24j2ufCaNjds3pW8vL Njx7rFcwK4NjkWUD7myhr5TDtlVIrlMEs20BuIVahXIGA8gaBLp4czI5b3PXDD5guj8X5u iL/fn+VxmxccVq4mwS17zY082kXy+vU= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=knRLJTpc; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-3292aad800aso31013681fa.0 for ; Fri, 30 May 2025 23:54:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748674465; x=1749279265; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7KKSGxiWUusF3NpXnt88RhyaalY1bWPLatpBLtUB9C8=; b=knRLJTpcyDftmuLQRWMrOM47Kmj+N+4kxg1tJ/SwF6fhtSl4WGiUxV4GCzJBROOWST ZzkW/7gZt7q7jvSVFzTcjX/pE9LLErsBYl2oZH8hV3zjpYlTVYqim9HrUexBjOKXSs8v qCa3Rl+P5ODF3Ukw88gT6mIy9wknbdPi9vzjGaABgToLxtrsHvw8oFcObt9oopWRG+cI 1Okc/Tj9XQppzHS9L3yN+ZSH8LNk6tFtrAkwiE2EOTxYvnPrdViqKURxAPWTvf18qM42 Rx44bJ6sCJNIG+nBVR1EbndpKMnHC7IFvcib0QfqR9h76u840ksI+AIKe6gfaf/wFD6L +mvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748674465; x=1749279265; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7KKSGxiWUusF3NpXnt88RhyaalY1bWPLatpBLtUB9C8=; b=er4cpXW2+q07ilPvLuc45GrOWu5FwTM4DsuuMBcLfFKT+yTiwwAVi471FhMyVlQ5ID lD+eu8FUmuy9mgVDzO3uuj7f9llxxmGuaMwJlWDNsvgRjpYXQkanj0Xu0yFbRHwFegw2 Srjx4odd8wpkqHxnnedRdC9b+hBxU/ltxKhCHVX7NiVyCk+U68ZM3hrrI2fdBABuxaDT J0w0v82xms3RlN0UxcXSY+kqO0dH/ebTowXboBMemPOhwfMSdb2zOR1thPygREBpJ+4k Tc3fh/gl23JLjPjs+Tu0J1LURA1fZvn9uNobo+pvJcJgufZLPq5Q6SPscQnfzsbjTXzh YVYQ== X-Gm-Message-State: AOJu0YxILFq4GSpyOMdC5Z+EfX5nIPrK9PD1SZc71ZLJzmsbEoDkpYCD HXgZQRkIXK33yquhjuxfXdoIi7oKijiXJzs+z4utsPhkyuIo0WeKl7UEZSUPoeLPCHyK/kw6xRE FppzmLiChFhugSqKbWBlsxRJxG2kkIlzgGEdV/Fw= X-Gm-Gg: ASbGncvG1OGL5O8qWRGc0BZLOx6ceLlHad8EkOXP+MCeoqNv/5ZiSy/bfSdYMhdg2NI tMfSP711mfPe//FzlS8jlKX9hYZizelGllRjxaGZcqZHMDR2sXlcLl4k+R7+wBbvPdv3CdnhyAm HCW+G6So8CCG3MoHHl7UGgGGgIHNX8H0+3 X-Google-Smtp-Source: AGHT+IEjuXhc+XS1tob6iJMFsogLwCN5+7xydvAzxLaAFbdlYQ6806qWJBCm779QeK84FiteX/3OJ/MVWxfqcx3rjbA= X-Received: by 2002:a2e:bc1e:0:b0:329:1302:a521 with SMTP id 38308e7fff4ca-32a82378ac8mr40695291fa.2.1748674464504; Fri, 30 May 2025 23:54:24 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Sat, 31 May 2025 14:54:07 +0800 X-Gm-Features: AX0GCFs7VJ_XozhddB2_YVsREaUzj3-1R3pIid7kwPCuF7uBRdjCty7q65Y57Qo Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Barry Song <21cnbao@gmail.com> Cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 81EF81A000E X-Stat-Signature: 7dt9p3pscdqe6kpwj9uooz6wm57d3mni X-Rspam-User: X-HE-Tag: 1748674466-210668 X-HE-Meta: U2FsdGVkX1+Zy1eMr9rmmSh2qVeIpiaul/GIaiM2ipTtPsk60m9nTGLmdKuxw/O5xmOhCjskCE1GE9IbsR2CducKnm1kjPcJ69XiJF3VCF6rRDDx+xNa7g+bM4k2pcKm2t3NLSg1KLRss9gkfJHHxxyodC5AdO9w0/bBWHZP5deG9rUN7198MD2RRl4JK6jHW3eQGFIdc7CreclBbFHl+XuBbbKIiQGXXEjEcP3JNC1/ro1lKEQyvqibY/5aEu34cpJyKo5yQPXntUySzjH0GQrTRYtX3GPUSqrXIDjPGgYdeb5mWsS7iALdrwyKt51vXnIGv446UQFsrn4sQxfPPPfRPOBMYB4x5aigzwxm8+3lQpNPkHKOncDRxRTh4sAEscFebMBs427H48bz6ojoqlzFwIX7Aq45no9iqLhDxtOuYM+8XK58xfIBzIFdeCyjMwh7VQrOiewY3Qph/DC0419sZAgw5gmmdi9bvACUAxlApGJgWy1mmtFWY7I2kHKiB8K04UXSq0MfNlrnNpjzG4UXyXMMf3kFRXB0rRGFNCuE4+akzxeu90tSPyL3yjmIEgy3SIj7Effeu8j5FvPiR9gzYhwPDeWxpLiOuIyR5KeWAUnl/Ra0ktopEcw87NRALIPpDUwkAsoDE2f+zfZAYhkx0g43+x6WPs8orvemhN/B4Glu44xdSFL2+2C8kzFualax8kvsyx7sMSUkRm5C4hCZv1i3mdN+9reXP38WnbnesCuDDrcAkARDSXHUR80CqP6tWhsruad7bEwdRoogIcb3/mNhtpNW0UvuWoCsihthSnPgzjzvAjqU1EXn3angS5xI8KijDBFjb8iXsZtxu/EPSQwyz0FtOUw8hVeCQG4uEWv+i6f8NgS5+x8l2WtuZAk9K7hGa6QGxm2htjaA2Ou5/O+yq58xqF/+k06KwhcwvhNSx4xN/ncYpWcJ3F7zh3nXBPVDZ9AnnaeGai2 G8MJW+LI uBLPtqRaU0FfgxRL0SD+bC/X8iA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 12:42=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > > On Sat, May 31, 2025 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song = wrote: > > > > > > From: Kairui Song > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cac= he > > > lookup, and try to move the found folio to the faulting vma when. > > > Currently, it relies on the PTE value check to ensure the moved folio > > > still belongs to the src swap entry, which turns out is not reliable. > > > > > > While working and reviewing the swap table series with Barry, followi= ng > > > existing race is observed and reproduced [1]: > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache= .) > > > > > > CPU1 CPU2 > > > userfaultfd_move > > > move_pages_pte() > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > // Here it got entry =3D S1 > > > ... < Somehow interrupted> ... > > > > > > // folio A is just a new allocated= folio > > > // and get installed into src_pte > > > > > > // src_pte now points to folio A, = S1 > > > // has swap count =3D=3D 0, it can= be freed > > > // by folio_swap_swap or swap > > > // allocator's reclaim. > > > > > > // folio B is a folio in another V= MA. > > > > > > // S1 is freed, folio B could use = it > > > // for swap out with no problem. > > > ... > > > folio =3D filemap_get_folio(S1) > > > // Got folio B here !!! > > > ... < Somehow interrupted again> ... > > > > > > // Now S1 is free to be used again= . > > > > > > // Now src_pte is a swap entry pte > > > // holding S1 again. > > > folio_trylock(folio) > > > move_swap_pte > > > double_pt_lock > > > is_pte_pages_stable > > > // Check passed because src_pte =3D=3D S1 > > > folio_move_anon_rmap(...) > > > // Moved invalid folio B here !!! > > > > > > The race window is very short and requires multiple collisions of > > > multiple rare events, so it's very unlikely to happen, but with a > > > deliberately constructed reproducer and increased time window, it can= be > > > reproduced [1]. > > > > > > It's also possible that folio (A) is swapped in, and swapped out agai= n > > > after the filemap_get_folio lookup, in such case folio (A) may stay i= n > > > swap cache so it needs to be moved too. In this case we should also t= ry > > > again so kernel won't miss a folio move. > > > > > > Fix this by checking if the folio is the valid swap cache folio after > > > acquiring the folio lock, and checking the swap cache again after > > > acquiring the src_pte lock. > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so fa= r > > > we don't need to worry about that since folios only might get exposed= to > > > swap cache in the swap out path, and it's covered in this patch too b= y > > > checking the swap cache again after acquiring src_pte lock. > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCz= i+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > Signed-off-by: Kairui Song > > > --- > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index bc473ad21202..a1564d205dfb 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -15,6 +15,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include "internal.h" > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, = struct vm_area_struct *dst_vma, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > struct folio *src_folio) > > > { > > > + swp_entry_t entry; > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig= _src_pte, > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm,= struct vm_area_struct *dst_vma, > > > if (src_folio) { > > > folio_move_anon_rmap(src_folio, dst_vma); > > > src_folio->index =3D linear_page_index(dst_vma, dst_a= ddr); > > > + } else { > > > + /* > > > + * Check again after acquiring the src_pte lock. Or w= e might > > > + * miss a new loaded swap cache folio. > > > + */ > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > + src_folio =3D filemap_get_folio(swap_address_space(en= try), > > > + swap_cache_index(entry)= ); > > > + if (!IS_ERR_OR_NULL(src_folio)) { > > > + double_pt_unlock(dst_ptl, src_ptl); > > > + folio_put(src_folio); > > > + return -EAGAIN; > > > + } > > > } > > > > step 1: src pte points to a swap entry without swapcache > > step 2: we call move_swap_pte() > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's s= wap entry > > have swapcache again - for non-sync/non-zRAM swap device; > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* s= rc_pte; > > step 5: do_swap_page() for src_pte holds the ptl and found pte has > > been cleared in > > step 4; pte_same() returns false; > > step 6: do_swap_page() won't map src_pte to the new swapcache got from = step 3; > > if the swapcache folio is dropped, it seems everything is f= ine. Even if it's not dropped, it's fine, the folio doesn belong to any VMA in this case, and we don't need to move it. The problem is the swapin succeed before step 4, and another swapout also succeeded before step 4, so src folio will be left in the swap cache belonging to src VMA. > > > > So the real issue is that do_swap_page() doesn=E2=80=99t drop the new s= wapcache > > even when pte_same() returns false? That means the dst_pte swap-in > > can still hit the swap cache entry brought in by the src_pte's swap-in? > > It seems also possible for the sync zRAM device. > > step 1: src pte points to a swap entry S without swapcache > step 2: we call move_swap_pte() > step 3: someone swap-in src_pte by sync path, no swapcache; swap slot > S is freed. > -- for zRAM; > step 4: someone swap-out src_pte, get the exactly same swap slot S as st= ep 1, > adds folio to swapcache due to swapout; > step 5: move_swap_pte() gets ptl and finds page tables are stable > since swap-out > happens to have the same swap slot as step1; > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the = folio. Yes, that's exactly the case. > > Yep, we really need to re-check pte for swapcache after holding PTL. > > > > > > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm= , pmd_t *dst_pmd, pmd_t *src_pmd, > > > folio_lock(src_folio); > > > goto retry; > > > } > > > + /* > > > + * Check if the folio still belongs to the ta= rget swap entry after > > > + * acquiring the lock. Folio can be freed in = the swap cache while > > > + * not locked. > > > + */ > > > + if (unlikely(!folio_test_swapcache(folio) || > > > + entry.val !=3D folio->swap.val))= { > > > + err =3D -EAGAIN; > > > + goto out; > > > + } > > > } > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr= , dst_pte, src_pte, > > > orig_dst_pte, orig_src_pte, dst_pmd, = dst_pmdval, > > > -- > > > 2.49.0 > > > > > > > Thanks > Barry >