From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CABCC5AE59 for ; Sat, 31 May 2025 06:25:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7D7F6B0155; Sat, 31 May 2025 02:25:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E551E6B0157; Sat, 31 May 2025 02:25:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6B1B6B0158; Sat, 31 May 2025 02:25:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B0C8A6B0155 for ; Sat, 31 May 2025 02:25:36 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 733AC161156 for ; Sat, 31 May 2025 06:25:36 +0000 (UTC) X-FDA: 83502216672.16.9830CF6 Received: from mail-lj1-f177.google.com (mail-lj1-f177.google.com [209.85.208.177]) by imf20.hostedemail.com (Postfix) with ESMTP id 9ECD11C0005 for ; Sat, 31 May 2025 06:25:34 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DFaLs3VC; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748672734; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=w/4OXa2hzFRR47673rkgb/Kf+clw6rbd55Z6IECRLgA=; b=bubgjQBehRjpmmOzzPMUQCY6FITbxh0/d+P192gffsABsStynsJ2CPEwciRO46/QEhZaKV Dbycdfk5ElKMyA1qYoVzMNc+ImBvB+b4FjRaWF42cDYPyMRn0YNYQPGmvOsLvxC3XO3Oj6 uFZHgtBuxJtSFEjbp/fb9wxjehGQ9JA= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DFaLs3VC; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748672734; a=rsa-sha256; cv=none; b=7AIcI91l67yGcbx8MsMIGWIniAkDZf78d7ms68G7Am/03Rvg+iorNXXi2XSNalmE7eeNwN uxqMStzisDpK+cmHgWHV4jXhoGQXwaNdzNvZSzczHBDwEsz/MfTCmsI0cK9p3HOFKIlldQ 1xoy4NlT56BO1Ip2nxjRsKJ8S5/cijs= Received: by mail-lj1-f177.google.com with SMTP id 38308e7fff4ca-32918fe5334so27133311fa.1 for ; Fri, 30 May 2025 23:25:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748672733; x=1749277533; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w/4OXa2hzFRR47673rkgb/Kf+clw6rbd55Z6IECRLgA=; b=DFaLs3VCXpKaJqyDQXzwcYnzeAMUEZSdiJB4lV+9ylgfTWi59npcaxevOp/3rZcDkf AfshkQtQ8fS5dyZY+H/SOKfPFeNzdUvTW4yn/M3I8aN+JkFxBWMfzqC/K9RFFmD8IjPz uy574hRBYiI7jeYOZDAGqWA8na3V7xtU02nHKLV9NuZnV+OBENR/IPN+3nuSzWjPFtcQ y7gBYYTTJdC5UxogPNPwm3y2/iMPgFAIUGcYHOEnNBHb1rDdsR/kGvsglgf+61z0ulZH c50LR/Gu/rp6RsYRamgysYfm7d8RSXhx+eQ/C1beX8kPFiMVJKBoRAJ8834dIU9YZoR6 KO1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748672733; x=1749277533; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w/4OXa2hzFRR47673rkgb/Kf+clw6rbd55Z6IECRLgA=; b=ROrmA3K4WbrOP9YEp5tb0RI5zV3xTTStPak591V5vuFukWsKG86hLQi4S4fCdedAIi Y1T2qbw2kJabiZK+tbHIsfsE1H6xKHSWHLgP2Crb9EBY72CVfwvoY+vxIpoiIWQK/I6R PEcwosaxKEwdsoAneCnOAVs6aycnhVcRfex+Lc/qu/iFXnHpm10RRTlnMkF6Cjq51uTE aIIvKnHzMSJRHI/yIUA2p8sbmfXugKgilM+r+xZsG0dzOTwHFNUww2sOy8d0go2wlrAd un8+v1sif7Bz2llFqTWC4P91tbp70YRKA7hPJx5A/GMgMOFl3QHbg0GfeLqt+MToJ+o6 cyyQ== X-Forwarded-Encrypted: i=1; AJvYcCXDhJ57nOzqpjNvsfaX33lxE6319jOj/Nt74dcrbLlv5eh1HIgol5a92X/RfYrTS3FA1Vhz84JxGQ==@kvack.org X-Gm-Message-State: AOJu0Yz05IeLU3brMjIIl8vHn0rlx3s53Kl5a267bXYltGAW/HwZ07fw w5Sc7Pfyntc1hiSbAXbyIDR750pq9KxMRCQjECOnFz4vLf2vsedaHYIRCBtHVLezTp+fn1voVrv qL8g77FWcypGJyoOBx6Jzck8CoTq7qPjTapytZmQ= X-Gm-Gg: ASbGnctSA9SfBDPII/6j8OdpcYODyQYguIzFnNvUsnedig540XqEytzGTt01sHEvdbH yQlnhEtLAv0Lk5EHHTMWSaJrpwV2dbs3LMnz6YR1VPnbTMM2Y4fU+g8j2aMNgpBdDGVf/pdJz9H VQScxaf5kETCApzRkqHE6yT139utJarfFiMkBdRASmNhA= X-Google-Smtp-Source: AGHT+IFhM1XZLO7FrleH7PqM7ZD1Wco65eL8LCXHXmS8zuJtDDrVY7wm2Kw2/lda2Mx5v32XBJlDb18N+HNo75impDo= X-Received: by 2002:a05:651c:110d:b0:32a:5e74:5721 with SMTP id 38308e7fff4ca-32a8ce4a7f3mr15703871fa.35.1748672732752; Fri, 30 May 2025 23:25:32 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Sat, 31 May 2025 14:25:15 +0800 X-Gm-Features: AX0GCFuYMKY5sjANTAwsyv716ph6C9qB9Snsja0SuMF35VaUtRDL9vcNOQSsYbM Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Barry Song <21cnbao@gmail.com> Cc: Lokesh Gidra , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9ECD11C0005 X-Stat-Signature: 8ou4jsrggouwdw9sjjkz8j6awqkcnbsi X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748672734-304379 X-HE-Meta: U2FsdGVkX1/IkFdcxfcmHX6x5v3KssD3c8ptKwvEULq8nHoXejK3IG/NDlORZZUqQakHHS/0uY3rrxt6YP9M1jBpzaP1mh0yXC4xtRuH4mq53nphmvmsFmPxtWkkR3UcrrrbkW0WNHdv44mSLHEcFXVRZidC0FE72nqo5hb8Zb7d8J1FdGeTdJv164WW9pWP0IBN+a4Vi8j7141di09SERoHb5Quuy69LZrVKhgAGnEZnUe1KdDg/i4+sbgQ0jyjgm15W/XnC763c07z/XrB1oQgeYSRNnyNqXQopJlf+D+/gjkVuau1n6tLGfPSUK10SWQ5WYlzsKoh3VHqhfS2FpWRoIxuQd3prxFoCaFacYsB44XUconB34B9WBR2R+Lx2At/5WJQW+T7Lz75YZRDfy2P3i0ZCDp1h+LM2EWlASno3GwNnCO4AQ059Jav9tSWxXbP92besjNi6VcsMuTBggwvpVMMNuaZ1NjvLmYsjgWgFKRPd5PB0wB11YFdphZZDPvoe9E4KTppgjwECLgx2qQAg11s29shMPlm3jnHXANmdqQkJTSPSsbsw2nJvm2s8Y2v+4ps7rQA1j+T0VBIEFQhN8tj4Db3A9UZhdFtZ2orfGSV1XRUsoQGuuwE8SDFPcrITsJX2QnjcIZQN+iyEgMpcBWFADHkeLLav391rC5EP6b4L9CnsyXiCOfOlq10ZZzHQoX1ONFkjPoJcHRVakRdHSGjwpZA3vtibX10dUTzPTXNV22QUY8R4aXA62UDHYCBIjSQ6EIwFxeDoCM6ngm7b2vDFJ0b0Yr9Q0QOdBkscubHlomP+/risCwNsjgWupJI7V1mdyQq0reIT6AfQqsirBHKyd+X4XQ5+jBiFsjCQwadndhn0c/Np41jF6+mciuuTKXUmHcgs5Ea5/NOa+W+j9kEr/2+/t7dUJ14vJuZW7lJcp01jLjLTiMvzpVzrue9EA+Y4OwaUk/nOYu Xp47IyPL 1QNqstIQdTk3qnA1GxVNh8Cv0Og== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 11:39=E2=80=AFAM Barry Song <21cnbao@gmail.com> wro= te: > > On Sat, May 31, 2025 at 11:40=E2=80=AFAM Lokesh Gidra wrote: > > > > On Fri, May 30, 2025 at 1:17=E2=80=AFPM Kairui Song = wrote: > > > > > > From: Kairui Song > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cac= he > > > lookup, and try to move the found folio to the faulting vma when. > > > Currently, it relies on the PTE value check to ensure the moved folio > > > still belongs to the src swap entry, which turns out is not reliable. > > > > > > While working and reviewing the swap table series with Barry, followi= ng > > > existing race is observed and reproduced [1]: > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache= .) > > > > > > CPU1 CPU2 > > > userfaultfd_move > > > move_pages_pte() > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > // Here it got entry =3D S1 > > > ... < Somehow interrupted> ... > > > > > > // folio A is just a new allocated= folio > > > // and get installed into src_pte > > > > > > // src_pte now points to folio A, = S1 > > > // has swap count =3D=3D 0, it can= be freed > > > // by folio_swap_swap or swap > > > // allocator's reclaim. > > > > > > // folio B is a folio in another V= MA. > > > > > > // S1 is freed, folio B could use = it > > > // for swap out with no problem. > > > ... > > > folio =3D filemap_get_folio(S1) > > > // Got folio B here !!! > > > ... < Somehow interrupted again> ... > > > > > > // Now S1 is free to be used again= . > > > > > > // Now src_pte is a swap entry pte > > > // holding S1 again. > > > folio_trylock(folio) > > > move_swap_pte > > > double_pt_lock > > > is_pte_pages_stable > > > // Check passed because src_pte =3D=3D S1 > > > folio_move_anon_rmap(...) > > > // Moved invalid folio B here !!! > > > > > > The race window is very short and requires multiple collisions of > > > multiple rare events, so it's very unlikely to happen, but with a > > > deliberately constructed reproducer and increased time window, it can= be > > > reproduced [1]. > > > > Thanks for catching and fixing this. Just to clarify a few things > > about your reproducer: > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or > > MAP_PRIVATE will work as well? > > 2. You mentioned that the 'current dir is on a block device'. Are you > > indicating that if we are using zram for swap then it doesn't > > reproduce? > > > > > > > > It's also possible that folio (A) is swapped in, and swapped out agai= n > > > after the filemap_get_folio lookup, in such case folio (A) may stay i= n > > > swap cache so it needs to be moved too. In this case we should also t= ry > > > again so kernel won't miss a folio move. > > > > > > Fix this by checking if the folio is the valid swap cache folio after > > > acquiring the folio lock, and checking the swap cache again after > > > acquiring the src_pte lock. > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so fa= r > > > we don't need to worry about that since folios only might get exposed= to > > > swap cache in the swap out path, and it's covered in this patch too b= y > > > checking the swap cache again after acquiring src_pte lock. > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCz= i+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > Signed-off-by: Kairui Song > > > --- > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index bc473ad21202..a1564d205dfb 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -15,6 +15,7 @@ > > > #include > > > #include > > > #include > > > +#include > > I guess you mistakenly left it from your reproducer code :) > > > #include > > > #include > > > #include "internal.h" > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, = struct vm_area_struct *dst_vma, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > struct folio *src_folio) > > > { > > > + swp_entry_t entry; > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig= _src_pte, > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm,= struct vm_area_struct *dst_vma, > > > if (src_folio) { > > > folio_move_anon_rmap(src_folio, dst_vma); > > > src_folio->index =3D linear_page_index(dst_vma, dst_a= ddr); > > > + } else { > > > + /* > > > + * Check again after acquiring the src_pte lock. Or w= e might > > > + * miss a new loaded swap cache folio. > > > + */ > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > + src_folio =3D filemap_get_folio(swap_address_space(en= try), > > > + swap_cache_index(entry)= ); > > > > Given the non-trivial overhead of filemap_get_folio(), do you think it > > will work if filemap_get_filio() was only once after locking src_ptl? > > Please correct me if my assumption about the overhead is wrong. > > not quite sure as we have a folio_lock(src_folio) before move_swap_pte(). > can we safely folio_move_anon_rmap + src_folio->index while not holding > folio lock? I think no, we can't even make sure the folio is still in the swap cache, so it can be a freed folio that does not belong to any VMA while not holding the folio lock.