From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A28FCC5B549 for ; Mon, 2 Jun 2025 22:08:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 400D96B035C; Mon, 2 Jun 2025 18:08:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D90E6B035E; Mon, 2 Jun 2025 18:08:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2EEBE6B035F; Mon, 2 Jun 2025 18:08:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 108EB6B035C for ; Mon, 2 Jun 2025 18:08:24 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 73DEB121491 for ; Mon, 2 Jun 2025 22:08:23 +0000 (UTC) X-FDA: 83511850086.26.2C48BAD Received: from mail-vs1-f44.google.com (mail-vs1-f44.google.com [209.85.217.44]) by imf22.hostedemail.com (Postfix) with ESMTP id 9A0B7C000E for ; Mon, 2 Jun 2025 22:08:21 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UxIwtTa1; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.44 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748902101; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=e1bjsXpVHOI5ZIszrbadfu9xBNyVRNiAQcBU5726rrk=; b=N1/CvbkcV3KaOQkzH8SDkgwRbmyf9JymIYiCDBIvg3ChG+onn5H6ok8DdGlZdAqaK0fhui dntd4cLBZzkPn4dPU85CbgdOiJFmZcOmpWktXPiHJs6KdfeShtn8B7nbd5lRWLjHcXDIhl vgFJR+HQY+M1vmR+qKSUk6BF3iUYh9s= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UxIwtTa1; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.44 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748902101; a=rsa-sha256; cv=none; b=14CH82wtqKeHvaFrh1VaB6oXKkPg2UtlkIYUfKXZTILt8l+eokdXyaPUtPYathQfeFbJen MJQrAgwSSb+gRD319XDxFGSuQTq+FkyGL8rEkfvw0nR0WkLCsdLUfMTLah/+nqpxtIjuuM iqsKw5I5vmBvvX6vKSBUWQFpOUF4CVs= Received: by mail-vs1-f44.google.com with SMTP id ada2fe7eead31-4e45ebe7ac1so1265213137.2 for ; Mon, 02 Jun 2025 15:08:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748902101; x=1749506901; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=e1bjsXpVHOI5ZIszrbadfu9xBNyVRNiAQcBU5726rrk=; b=UxIwtTa1tlmxwRA/Y39/5kwjSpUSOVy4AjiSTzACx++rSbRSUWnYiCIZE28kTPtlkN MbQ4b08MtRZKBW5Uh0RSHND6xcNIoXYFkiWZ1VdUHldXuOG/ZXyaogVxfDipNc96d1Jj H3Z0GgRPXnYqEC89D6Mc6Uqd8ZDuuI4noTivrzZ/lnvioo9LBJv4yojH5PgSU3bKh4rZ OLMC9k9IRRB4R3dngvxEQQK5c4hQow0sm9QYOMp2g+3HxhDsFP4YutaxzPlD8pAxxTaQ FRP9g3QMSl1mDg6pHsbW+cdva0j/dnJuxgFPf8RtkwL/SP5raRIf8+/Wos+MQujttFTi p03Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748902101; x=1749506901; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=e1bjsXpVHOI5ZIszrbadfu9xBNyVRNiAQcBU5726rrk=; b=muLq6l8LBW+Aa5VONt7yMDcoV62xG76w8DAiDbe0vT1fhuI4B9ZpCkRaCaYHmceR5T T60rDZRFVxvWrrY8awCAFOEvH5bqYgqCU4+w9G3ZgQImxK1loElyXREsyDAya1ICswRQ Vb/VyfXiB/hCYy+nXbhnYUpW1wSyDJj4Yl6OEJPtNMMyNUnfJfNJeHC29AGrhxqWcsF3 AqJMXK6tUE2nDOOIcsrvWYbWNAT4i5DzoiztDNNtetqOuuEk9vjy1W4sDnPrhbVL/hEM SnD7tCKPJfHoVt8iy0EkFvR+AdaAiNx2EoxcNOJqGVqQikAZwTLzM76oAUWyi0cc3vrH Bc9A== X-Forwarded-Encrypted: i=1; AJvYcCVeVYa5a562Njv6S38B0FJfhj2JecUSa6ZmaKPfdW5qnD639xhrWNrPpp23A9oz57ezESumXPN+eA==@kvack.org X-Gm-Message-State: AOJu0Yy73EROszbBYRpuhejoHT9DvgEf7hrdtqaAf/r15JWuMaJEMx8V rvkfWiRLL+42ygACPm5ND3nhr2mvicjcf+ZGtweRjXWoWiDTvjuOMh1E3fdgcn5coZrw7kHGm+U pSOBvmXc45R7bbKD/VjSCk/AerZvBHpQ= X-Gm-Gg: ASbGncs5xl7hyXMWcel2n6m6885nBAEXDcX0YUav5jMB78J6juYVBLRNVSV0aUShGVF VUQWnJ5+DNUFIKejOlCUKvyeA+RelfNB7cJ1LqTdgDHm306mM4B49XgPyXpb4L6EHhsjqEPPnCt 04rEiMCc/5oOKP9WrKMzRndh/g2KtZoRDASC4gHFUA2nfK X-Google-Smtp-Source: AGHT+IGNzI60dSLYollS+mg1gFdbjGQgBOwciHiqeEYkQr4lRCqkU+jWMdSSFqE/VxcsypKoP75InS1VD/0jqAG3uK8= X-Received: by 2002:a05:6102:158f:b0:4e5:8b76:44c5 with SMTP id ada2fe7eead31-4e6ece6b680mr11090213137.22.1748902100639; Mon, 02 Jun 2025 15:08:20 -0700 (PDT) MIME-Version: 1.0 References: <20250602181419.20478-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 3 Jun 2025 10:08:09 +1200 X-Gm-Features: AX0GCFun4bSH_h7lkhTVyMQ2zfLiYUNalUUqsAC0LS2LaJcC3bsiH3TSJR3c5h0 Message-ID: Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Peter Xu Cc: Kairui Song , linux-mm@kvack.org, Andrew Morton , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 9A0B7C000E X-Stat-Signature: 3ej19jawhir8mkpsz6jw8dis7inyq6mm X-Rspam-User: X-HE-Tag: 1748902101-330382 X-HE-Meta: U2FsdGVkX1/F5ZiJ3OTJvn6UTvTVx3j3O+IbSr1r2pl34jPg/DsbSjwKkpY/Y0Cpc+dnIb86WxDMReFbNGunrUXMdYI/WZU+K6boPxRPRVgrlaWYw3MARfXTHUWA7Fu/DiEnnqCEYdQGwrfuAakmDU7/c5LsDxaDRgr3ikgUx6aE9P5ThW3mbiIrMw/Wus4Jg3KN4yVe/ecZARZPVBcyAvS9vADeEJAwV0bY5Bj7Ps3Cz2lZ7Qfoi/lOEsYOhlCQamygwoCq7IMP+LEOEez8UzDLQ8yOefINvCgCoPWGR4eMWDm+NEBGsD97luVlv7AHMU5fLubQaPrQyuA9jLO4M94KwX0sAy+ixN1e8KsduNn0REDnBTBEWUqyPILOOjMTjcm0mmTOG27+pb4pT5veU0DDEmjvjcOUwH2WpWqCbyVHF0qkNUPZVNuylNfBsdApiKoBTRvybVN42G3ttco6VmFlXNNLmcNUHo5nwpnny5UB6Kqz534iAGrqh74pwFKASZori3zf5i9/POrMnr8YMOOll7NCFyul6emFS69XDfP/Q+PFrjvNfRQMW2GCK29SQ+jVSUr11SkgLg0HD6p/CBFmoERVj8l5dNypvWBG8YgS4Hm4FA3+guwOKJcjKJd4usgJWjYQW5sHR5o6sbJ1ssw1bAtabJuhbc+FH2ENECmbbBBYXxjkGB37YCN0i8/Jd6urjnuu3FRlvC5ahvEbpP5UuruCbZ+HN3WAUsPRURmL5YzcTTj3BMWnY1fP+46ekDOemARUYHvxUqcHBH9H+rxekwVyD5Qob6vjEoYm7E7eFz0+Wb1KNy0Foc3pxntiS0ptr9QGQzAh15QNGkQfjSrs9xS94+oAB9V3DKYHIdYCg8fa8nfSdsf6atP7/W5UQ5KwCJiqecF5LfqXCkyvjw4WlhlMTl/APRyv3En7yemsy/njewLiae5hXrIdEDa84dBVbYVn1r8SYTwr8r5 7SOJorVv LtJNFH139mzJsIZ2Yy640qyp1F1wuXf36mCcY2id3trkdW1YN2Zy/G+xZIi5sbKFJMBKUPH8BIfHnxx8rCEEpowcjTCzuXQsRUpX6NYjF3h42P7LmDYxsGNufgL43ckF9QG/phuy7dXf2C3DzFdCJ+Uftf5rr5k6FnnOOk9G7Rc4zl4RvfufHiyDQvnoq8X0Gz0p0WfE7Z0S4/oLX0sBirIKhH3PGCHBahhcRrYI2MTke+JlRVg2nFXQQdbGbVeGOj2Y+snPTJkL4H057tKIlhthhpnBr6mEDVoU/YJvmlN0scmNV/YzkNKEIpv6QBebfXXQTLGWxg0qECc4ydrf5RtZ6qzlRQOSI6AfGHffinGe6d9A0GRvVu4/tIMBFrCBL3xskNkuZPNjXCHl39O8bPBZNg0DsJm3/cMXTnsBr+KxWzUMYkdaW9wXjiDtJGCq6sGHqNYMMFzslkIE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 8:34=E2=80=AFAM Peter Xu wrote: > > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote: > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > lookup, and try to move the found folio to the faulting vma when. > > Currently, it relies on the PTE value check to ensure the moved folio > > still belongs to the src swap entry, which turns out is not reliable. > > > > While working and reviewing the swap table series with Barry, following > > existing race is observed and reproduced [1]: > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < Somehow interrupted> ... > > > > // folio A is just a new allocated f= olio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B could use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < Somehow interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry pte > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it can b= e > > reproduced [1]. > > > > It's also possible that folio (A) is swapped in, and swapped out again > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > swap cache so it needs to be moved too. In this case we should also try > > again so kernel won't miss a folio move. > > > > Fix this by checking if the folio is the valid swap cache folio after > > acquiring the folio lock, and checking the swap cache again after > > acquiring the src_pte lock. > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > we don't need to worry about that since folios only might get exposed t= o > > swap cache in the swap out path, and it's covered in this patch too by > > checking the swap cache again after acquiring src_pte lock. > > [1] > > > > > Testing with a simple C program to allocate and move several GB of memo= ry > > did not show any observable performance change. > > > > Cc: > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Signed-off-by: Kairui Song > > > > --- > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmai= l.com/ > > Changes: > > - Check swap_map instead of doing a filemap lookup after acquiring the > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh G= idra ] > > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmai= l.com/ > > Changes: > > - Move the folio and swap check inside move_swap_pte to avoid skipping > > the check and potential overhead [ Lokesh Gidra ] > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to date= d > > value. > > > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..5dc05346e360 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > pte_t orig_dst_pte, pte_t orig_src_pte, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > - struct folio *src_folio) > > + struct folio *src_folio, > > + struct swap_info_struct *si, swp_entry_t entry) > > { > > + /* > > + * Check if the folio still belongs to the target swap entry afte= r > > + * acquiring the lock. Folio can be freed in the swap cache while > > + * not locked. > > + */ > > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > > + entry.val !=3D src_folio->swap.val)) > > + return -EAGAIN; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > > + } else { > > + /* > > + * Check if the swap entry is cached after acquiring the = src_pte > > + * lock. Or we might miss a new loaded swap cache folio. > > + */ > > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS= _CACHE) { > > Do we need data_race() for this, if this is an intentionally lockless rea= d? Not entirely sure. But I recommend this pattern, borrowed from zap_nonpresent_ptes() -> free_swap_and_cache_nr(), where the PTL is also held and READ_ONCE is used. if (READ_ONCE(si->swap_map[offset]) =3D=3D SWAP_HAS_CACHE) = { .. nr =3D __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FUL= L); if (nr =3D=3D 0) nr =3D 1; else if (nr < 0) nr =3D -nr; nr =3D ALIGN(offset + 1, nr) - offset; } I think we could use this to further optimize the existing filemap_get_folio(), since in the vast majority of cases we don't have a swapcache, yet we still always call filemap_get_folio(). diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..c527ec73c3b4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, * folios in the swapcache. This issue needs to be resolved * separately to allow proper handling. */ - if (!src_folio) + if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE)) folio =3D filemap_get_folio(swap_address_space(entr= y), swap_cache_index(entry)); if (!IS_ERR_OR_NULL(folio)) { To be future-proof, we may want to keep the READ_ONCE to ensure the compiler doesn't skip the second read inside move_swap_pte(). > > Another pure swap question: the comment seems to imply this whole thing i= s > protected by src_pte lock, but is it? > > I'm not familiar enough with swap code, but it looks to me the folio can = be > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as t= he > folio is locked. It doesn't seem to be directly protected by pgtable loc= k. > > Perhaps you meant this: since src_pte lock is held, then it'll serialize > with another thread B concurrently swap-in the swap entry, but only _late= r_ > when thread B's do_swap_page() will check again on pte_same(), then it'll > see the src pte gone (after thread A uffdio_move happened releasing src_p= te > lock), hence thread B will release the newly allocated swap cache folio? > > There's another trivial detail that IIUC pte_same() must fail because > before/after the uffdio_move the swap entry will be occupied so no way to > have it reused, hence src_pte, even if re-populated again after uffdio_mo= ve > succeeded, cannot become the orig_pte (points to the swap entry in > question) that thread B read, hence pte_same() must check fail. in v1 of this patch, we had some similar discussions [1][2]: [1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=3DiohbYf64= aYyJ55t0Z11FkwA@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp=3DA= rBuv3cJijHVXe6w@mail.gmail.com/ At the very least, [2] is possible, although the probability is extremely l= ow. "It seems also possible for the sync zRAM device. step 1: src pte points to a swap entry S without swapcache step 2: we call move_swap_pte() step 3: someone swap-in src_pte by sync path, no swapcache; swap slot S is freed. -- for zRAM; step 4: someone swap-out src_pte, get the exactly same swap slot S as step= 1, adds folio to swapcache due to swapout; step 5: move_swap_pte() gets ptl and finds page tables are stable since swap-out happens to have the same swap slot as step1; step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the fo= lio. Yep, we really need to re-check pte for swapcache after holding PTL. " Personally, I agree that improving the changelog or the comments would be more helpful. In fact, there are two bugs here, and Kairui=E2=80= =99s changelog clearly describes the first one. > > I'm not sure my understanding is correct, though. Maybe some richer > comment would always help. > > Thanks, > > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > } > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, p= md_t *dst_pmd, pmd_t *src_pmd, > > } > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > > - dst_ptl, src_ptl, src_folio); > > + dst_ptl, src_ptl, src_folio, si, entry); > > } > > > > out: > > -- > > 2.49.0 > > > > > > -- > Peter Xu > Thanks barry