From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B648C54FB3 for ; Mon, 2 Jun 2025 05:47:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 053F66B025C; Mon, 2 Jun 2025 01:47:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 004196B025F; Mon, 2 Jun 2025 01:47:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E10176B0260; Mon, 2 Jun 2025 01:47:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BE64B6B025C for ; Mon, 2 Jun 2025 01:47:47 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 54A5E58FB9 for ; Mon, 2 Jun 2025 05:47:47 +0000 (UTC) X-FDA: 83509378974.05.E5EB771 Received: from mail-lj1-f173.google.com (mail-lj1-f173.google.com [209.85.208.173]) by imf24.hostedemail.com (Postfix) with ESMTP id 6518318000A for ; Mon, 2 Jun 2025 05:47:45 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CG1CgfeN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748843265; a=rsa-sha256; cv=none; b=XljiCu1PUyeNiFfcF6xJm2VMjx2M7JDpSsscAOmc79oxVirT4OBRXkm728+/Czcp8xOR5O A/Q9OD5oFKmQUwVLOB5XN1BkbdALvh3DJI6TTSgg6lQ0HDXAvSE0YYPQc1LUlAHkbE6VpK SqN9eXd96OKyTzhnlbz7IoJdyhqvhew= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CG1CgfeN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf24.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748843265; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ETNvbGvKme7nAnnmtzIyBdJn/0gcb2FNly7MRGQKSjA=; b=CjFnXyj6pq7K6HWFfz7emyz6+ff+Dp+hh7ZOCg/EC+UJPM3a5gDE0mWE+l8hre97Tnaiz8 x1dMezxF8828GSkxXQfNQLXirMknDdhft0ANQT8FFAVTRAD/FFZzH21jf+LGpQH0YVVtZG Q2vp3fT/zN9OhMdeW3Kc1XwLaMs8UYw= Received: by mail-lj1-f173.google.com with SMTP id 38308e7fff4ca-32a64f8480eso53515931fa.1 for ; Sun, 01 Jun 2025 22:47:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748843263; x=1749448063; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ETNvbGvKme7nAnnmtzIyBdJn/0gcb2FNly7MRGQKSjA=; b=CG1CgfeNlbXDjKJbSplDXJCzLQ99fAQpvFDOoBfEt/QqD7UohfYSX9ObzZtocUl/IR ZL8czjDFDQKsYgmimXEE/+ybTBAeZPeMiril8zUmdtWd10+rUShkbyvZOxDeLQu67sDd zfSX9QqRslWs66BWFIs93RKxLFofaB8dZskAPdu/ihwROvx5IvxQeb//Ln/FLZoomOGc oLkvEbVCXcu4lS3CnBiuctN7nDFLfeUf7zOoFXjYO1ky4WA4xPE4jBa90hfYvKMj+X6z Ug2lN+FhHAQk+9JDYWcLJNHkb/WBIvi/RCQuStpccsGhbf9/RDIZEeXpa8k9UBJvaMJv 1SXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748843263; x=1749448063; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ETNvbGvKme7nAnnmtzIyBdJn/0gcb2FNly7MRGQKSjA=; b=Su2bne272GZmXUHL2bDVnJKf1mtBY+xEsykxGsl8gMswAsLVkwyNZ6CCygNaXX2vxj cNmx56hTCTaZ+DuPukXOmRmjvmoCq9F3QdOH7sQmsrvERczW5v8uusOHZpMon2ax+jK5 6oNpY/uvdmY0hFYaFQlehxReiW4MH2LMZK5CAHt5e4UfCtP4vP28SYfWqXzQPJczuVsE wSQnrqEQhcgX6buTaMoleJvrP6E4dxnYqFKrLc7NauZTjhCYUOIIHkFg7mQXi3sSmHpy X4q+H1Vc/u2Lwoin7S+ZpxhnOry5lTdN7uwjcmoE1yM4l+gzZ0dcm7yCcgFRhMq3djts eEbg== X-Gm-Message-State: AOJu0YwmZ8YoWwx2CKHJVGew/OHRr38eict/P+hWuR+N5FB2n7Sb4mQ9 ftX3ifNefBtmucpuw9OClT2UPBL5mw3MxUrMSJgGvSlOjprfw10i3bV85UDsbiiPCT2tLk84nML rYIi5tfxEoc2eHFHU1tXL9gdaSwWnw3A= X-Gm-Gg: ASbGnctlPCIHow8DWTBtI3svyoRYtXdNRdGIqax3nV1rnmfPaXkUm6BCSWhFIwEovoN lHTvaFmDOeEzXc85TugJyhdDjoIZ/hwA64i7GtHKaawEHKlgUAEQU/zEQIv6hCCLqPRKdKQ1YAM FVTYz/dS8vrrjOGoDv9IlN+fV8lQmpYiK8 X-Google-Smtp-Source: AGHT+IHn3wg4wsWomGqU6A+Vt+xO9VDpV33AKxJYrE+LY2QPQSH7NokQuVYb4n+7SLmsmZzmi7hGLdX1i5AVlVWIpxM= X-Received: by 2002:a05:651c:1b0f:b0:326:e449:3442 with SMTP id 38308e7fff4ca-32a8db5e104mr31375631fa.10.1748843263104; Sun, 01 Jun 2025 22:47:43 -0700 (PDT) MIME-Version: 1.0 References: <20250601200108.23186-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 2 Jun 2025 13:47:24 +0800 X-Gm-Features: AX0GCFt6pHbMLecqqaWvuLEmwvQs6hTTd2T2X3d9RZidCJA-txiQGEkBEO3RMJg Message-ID: Subject: Re: [PATCH v2] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Lokesh Gidra Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 6518318000A X-Stat-Signature: qgjr4fnuye1rckxmrn9r5oegy757ksq7 X-Rspam-User: X-HE-Tag: 1748843265-800882 X-HE-Meta: U2FsdGVkX1/HsxtEiKtfExFGiYkXXe+x3O7s/widQVhsGl5wE5kZM5e+dU8yhxfFeL6iiXhbAVo5jVM4HPTKDwpZtKnslzlO8uMKPRj3U2HDdje+CHtW6QzEztOdW1gV475Fi2+Mnkng9XH759C1GA7bmhtpzxtj/FO7Dh62LSzaApjZPptFkhlwi6ofCsxTJccYcVGFsvSoG+K+dCC6DtP4DZyUoLBsqBrrCNsChaqHiBQvMzpMtsOOXyNYZk9hIx4Dhm6Cbdwzyu23yVAU4zq6vQh/zik6sZjhXUTmdYHBJzwtkjZVVt/pYIs3Vamhs0PWrFeiR2t44Ty3uat0+WkuEvkh/dUCqvr/ET5qWbL7PtC4TsqwJ+7Jrdqmv6M+wv0jR5bPC2dzVoErkeA5mQldrOpMBx6IbTnKScH3DzHNLn6nKlIAlicVt4y7jDjZ/jNM4rCcjaw7qlBAmOIynkN0i8vFBwfJRg727Y7zODPPZrSwcd4G5KgDlEjv50i256vOT1GbjAYeao5ve/TzYDowKAD+eZRX5YIKpNX/NWfvHWpCwQkmUAk8F1tW+lS5b4OHHrTGJ+oKHkaYLl4Hvh8cE8UVtXje7NQLjXNb31uKZ7uyFsO2kHMVQERB7UO4NymYmwV1N5wxxxlZnKwePU/nADt5d6eXbtR5YuVUP5JNNDZQBGgrrd1s6xsDeSElZ28HBQK3gL3WKyMCTVn0K8yluHJ6YplocB7vVl67P+jFioNhxocw9bzJ4ZrFwAxLRyfVK2eMuKhQWP00xKDpmLgwYrjyjqQYacW/igxAQ5shA96mmkFWL1HEZvQj1/8LATMjvwpBA4eBXb/tvBbsWkYhu4Yy4mzVvtV6xQTnb1rOFWBci8aZWy9bjk1AeB9bdvQGCZ8jQ0iBrHFwyDWa9xjk3UJM0TFeUEpG7QpRvfNkaLAwO00j93Sy/vmemShZXClXpEXg/qP5nEQ1fRY nAsPcDSL uLiQJQpsxl3o0yOausXNM4gKRjdOfqxJbGDKtmCIZLNC31pBCIK/nZ2UqVEffwBdQ5d0cqENCZmYmh5wz3NWGvILGVGJcRrlFOQpeQeyz0TGsUEfEHnREUu1gB0KwlCALRSEjp93H4XKim4vfMFS5yg20ogqyXbIZpMHW1qjROP9xkPnPt1ClWEWOapw9ciSNWV/CrqJ7ekZ8FjaFpqLmBZFdeMLNVjq11JEtAXVToT2pb4FXe8+SOTiJ486LJeSvfGzNcCJFsdm8tzTwCewgbCsnSyc4UBzT6VUXC4WL1ulrBjpKJ4zUg1QcudnePl8dUMjZsCaRGoRwWgfdLlhWhwTidOKk+oKsrxQqcDGBIHJz2/YHdaautRQJert9up2NqerGOEjcMWDx8ZFCVSagFdIsxjrRZfIyWdvvtRiAd8Q3QjihuY9/n8RVwooDmBuJtPUHotI9MBsnXV8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 2, 2025 at 9:43=E2=80=AFAM Lokesh Gidra wrote: > > On Sun, Jun 1, 2025 at 1:01=E2=80=AFPM Kairui Song wro= te: > > > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > lookup, and try to move the found folio to the faulting vma when. > > Currently, it relies on the PTE value check to ensure the moved folio > > still belongs to the src swap entry, which turns out is not reliable. > > > > While working and reviewing the swap table series with Barry, following > > existing race is observed and reproduced [1]: > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < Somehow interrupted> ... > > > > // folio A is just a new allocated f= olio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B could use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < Somehow interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry pte > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it can b= e > > reproduced [1]. > > > > It's also possible that folio (A) is swapped in, and swapped out again > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > swap cache so it needs to be moved too. In this case we should also try > > again so kernel won't miss a folio move. > > > > Fix this by checking if the folio is the valid swap cache folio after > > acquiring the folio lock, and checking the swap cache again after > > acquiring the src_pte lock. > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > we don't need to worry about that since folios only might get exposed t= o > > swap cache in the swap out path, and it's covered in this patch too by > > checking the swap cache again after acquiring src_pte lock. > > > > Testing with a simple C program to allocate and move several GB of memo= ry > > did not show any observable performance change. > > > > Cc: > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Signed-off-by: Kairui Song > > > > --- > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmai= l.com/ > > Changes: > > - Check swap_map instead of doing a filemap lookup after acquiring the > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh G= idra ] > > > > mm/userfaultfd.c | 27 +++++++++++++++++++++++++-- > > 1 file changed, 25 insertions(+), 2 deletions(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..a74ede04996c 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1084,8 +1084,11 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > pte_t orig_dst_pte, pte_t orig_src_pte, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > - struct folio *src_folio) > > + struct folio *src_folio, > > + struct swap_info_struct *si) > > { > > + swp_entry_t entry; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > @@ -1102,6 +1105,16 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + /* > > + * Check if the swap entry is cached after acquiring th= e src_pte > > + * lock. Or we might miss a new loaded swap cache folio= . > > + */ > > + entry =3D pte_to_swp_entry(orig_src_pte); > > Can we pass this also from move_pages_pte()? It would be great to > minimize PTL critical section. I checked the objdump output. It seems the compiler is doing a good job on optimizing all the overhead off since it's an inlined macro, but I can pass it in too, just in case. > > > + if (si->swap_map[swp_offset(entry)] & SWAP_HAS_CACHE) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > } > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > @@ -1409,10 +1422,20 @@ static int move_pages_pte(struct mm_struct *mm,= pmd_t *dst_pmd, pmd_t *src_pmd, > > folio_lock(src_folio); > > goto retry; > > } > > + /* > > + * Check if the folio still belongs to the targ= et swap entry after > > + * acquiring the lock. Folio can be freed in th= e swap cache while > > + * not locked. > > + */ > > + if (unlikely(!folio_test_swapcache(folio) || > > + entry.val !=3D folio->swap.val)) { > > + err =3D -EAGAIN; > > + goto out; > > + } > > This check will get skipped if the folio was locked by folio_lock() > rather than folio_trylock(). It seems to me that the correct way to do > this is to check outside this if block (right before calling > move_swap_pte()) and with 'src_folio' instead of 'folio'. Even better, > if you pass 'entry' as well to move_swap_pte() as per my previous > comment, then you can move this check also in move_swap_pte(). Good catch, let me move it into move_swap_pte then, it's much easier to follow that way.