From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83D36C5B549 for ; Sat, 31 May 2025 04:04:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A19016B013C; Sat, 31 May 2025 00:04:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C95C6B013D; Sat, 31 May 2025 00:04:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8E1516B013E; Sat, 31 May 2025 00:04:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6FBC46B013C for ; Sat, 31 May 2025 00:04:38 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CE704BE2F5 for ; Sat, 31 May 2025 04:04:37 +0000 (UTC) X-FDA: 83501861394.28.52FAF37 Received: from mail-vs1-f46.google.com (mail-vs1-f46.google.com [209.85.217.46]) by imf21.hostedemail.com (Postfix) with ESMTP id F2F671C000F for ; Sat, 31 May 2025 04:04:35 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V87G4F2B; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748664276; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nS5pE0o8S+sA9vSGdXa1WZSHPdE9yG19o8VINjXy5hc=; b=mzg2FeB8NrWG01cQQ3LbVdSD+C3kziW+YVpsX6fqJjWZncYmRMnvSrpnyniUHDfRmxiYmT 9JobhCUucu36Yz+FVI3GFeBy8rz2gCraF0PDSnPBMp+dAv8S4DZEk345mZEQHsXnLiygRp VcjblF6kllcmgcug1bIfEGhxwJ55lsI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V87G4F2B; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748664276; a=rsa-sha256; cv=none; b=XQkXMwS66MqVp1AV0t/duj6An3XcEPUengvImRUpyFxa59Uvn94mgOMUVbx9aHu3ccNjli osBMvor2mcdEQ8d7Ff10b55uhkDP1s9LSSVnhjURJeRm8ja25EI83ITFnECqwNYwJd5poo 4aqHV3fp87sXIE5MccuK1cpehzrPbzk= Received: by mail-vs1-f46.google.com with SMTP id ada2fe7eead31-4e58a9bb8fcso923138137.2 for ; Fri, 30 May 2025 21:04:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748664275; x=1749269075; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nS5pE0o8S+sA9vSGdXa1WZSHPdE9yG19o8VINjXy5hc=; b=V87G4F2Bunmf2+3u1AfEsCWsVBym40SoA1ETIaxnODEIjpZDSUkTq+bdH2/9iVBnPt QpnyYjLcee/rHRoq0zZygSbRf9CNvsOWPDwHXfCFSE/AYOnL22lfn2jbgwg86bau8KoU Uqahir8N0KXihu27JhRY0Q3MgoL/onU2TYsXRcIcZbBEeh4geOrfr2Vld4if9z/ZByxD cHq4ECgBaxO1IqCN6Fp4MI7SFoX37bNYUvcTx2r8WAknoixsx5QbFpinqlDhEH0AuZx2 fWJBYcPMtLWnJ5sl0ELHUNOrFBCVEq4Ym46RJu37JiD0UBEjap2XwFSbmIMTYoIA3jdc zxAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748664275; x=1749269075; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nS5pE0o8S+sA9vSGdXa1WZSHPdE9yG19o8VINjXy5hc=; b=mKazSWeG9XGlWIUA4vQvxnRa4EL5BKutYgGgxWSEpCCLfUgJoAwS2v7iglc2LWMzXV htuuov+pLDrZqAIlhTrZVUdTb7ovK+Qx0wF8HDdwzkl4JFxB8DZVqGI31LwhhmCy5N7n lLSGQADbJ1BfKy2qkL9WSMXjIB7YaXpWHCOzhocsMVwNnHpKcYfOXIYkJ/lWsTA30ivG VLjDPCIckwpygAWm1XfXtVQvzRxt/K39iufvr7Y5eusPcy+N2YhKynCcjKaImIjozhKV RdmvMjRLEtfsklatuJ1K6RJ9j2Dnz6mBKjypEoc+xT7e6jpw9w2IvwtDkCoRqzGrQvTN k/kA== X-Gm-Message-State: AOJu0Yy3YftcLMW5EFFX7ZUbRLqxaZNnPifauflxw89TCphVwRX8gjWB +kIQCZzQGWwUXa/1UqLGFzSkfy+uSqPw84CWQOTIdT0pPHNcQis85/tgw/+bpayd+oOVEHE6Acu EIglueWYzpAGXvCVZCuE9ik98WOZMIAY= X-Gm-Gg: ASbGncuPz52vW/6B+GkHIAlqnDddZffkKtccJcMJO/ojdgjHazvHslaHiP06CCIqSsV e1PMO/ZGQFIDaLF54h5gp/jsguEBZOBjNpXgc0JkXdAc86m0h2dJvUQefqqVGsO8S1QLqVvELTQ tnH3QQ/rSjSb2rScSQN8xFYLeBVA4mEyKtwA== X-Google-Smtp-Source: AGHT+IHIKNydtAt4W0TBhkJoaIkTJR5s9EqiEIjLFeZby8tWyvq4kiBnHdPvnCyp44lvW7H/Gde1X8aDqKUixQwPWJQ= X-Received: by 2002:a05:6102:3a0f:b0:4e5:babd:310b with SMTP id ada2fe7eead31-4e6e40f5bd4mr6051531137.10.1748664274957; Fri, 30 May 2025 21:04:34 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: <20250530201710.81365-1-ryncsn@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 16:04:24 +1200 X-Gm-Features: AX0GCFskxNAvcUu96I0HwtH4Vknv2CO5MuXPf8gnIuGqNINOVuNLXEaUl6id648 Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: F2F671C000F X-Stat-Signature: yb64bobborw9td3a9k9mosan16c7h1ra X-Rspam-User: X-HE-Tag: 1748664275-708216 X-HE-Meta: U2FsdGVkX1+R9Ug8mfR8br3zcxVxyYMf7HGP+66sisCAIdeoiVe86WTZNt/vNOczQXVjX6uWuLNjFFG+MOlox9FxeCNRVmkenfntR4MKQsTkS//iRpVbeqUS1G3xmluVFaF5t8ozodG/OYnSXrv8Az16xqQinjLTZUfiFrNhHk68MIrXd36SxxZPEXb3cpRR2+bKnKB0s+d+pddKYLOxlvvovPHKniuPVz5rklnW86iFNL6tiGbkJjQd3ajlcqjoW6K09bWV0nYWrC6ySSABXhWu3pCcfTBdskEeYnA/f7xiMPdVgWGP0fUbdwjnBytbC2xbhXypzRnegQ76zuMJGMszX2BR2mFWTzLhR8rF+ChFw6JZroskLBtqEkoKo3yBfWlIfKnELqWQMV/1h4oyLDZDlzzXASkFcW2aWtSl6foFzDpkzQdDoVpNgvt77+IFNBc2hPDOFFJUKmcnycV8cQOvEQNGB2M5JnZAWOimIpG0xhQL9ztNEqoY37MIshidEP6eftfCr1qFEVF5XTQZhImp82QdvQ1dvoJ3ff+V2nZsDMeYwJxbfLKPGeoLBiKUG66+CRBUrQcqNjA5jlTQC5veFL07BeqKvqnYfbVmCdgNPnBg7FRI7yls20uEzJQa2ZHKygQuHED74lNcTU9sUoC5xsSNwwXg868IeV5SWHZ+6P6DaztfHv/Ijcugy+xGS7+TkpO8/ayItjNPQ/UKASnYj8VtaA6mZ7ZlvAxu/SV3k8qyAiANnk/shLBjBdsa8Zivf+zOMyKnmaPXglQxvuaLcPuskTluQcIQobE41qHAXuW/26HoD0ne8+dIacuqcuSgy3TuRV0hlxEedamIcVSZN8fDhZT5sYlZMhTUqzFoDEsQbB6XnZi6HpKwJUMvXdy6HifxU6zoEwY9sO9SR8R54s/Sq5nU7WgGddv1bz1o0IrUlhoGSj4aVezviaQPQRwPkJB92LMHAR0+pcp b6Ib969K SWbREIPiRUKTY281zCRiq1Hb7YeODWhuJWqT2Gl7byX1N2cUmFxDRTQMmHX/pI+Xq+CkOJVttee/WW6O3V4yTsx5NZnzT4prkmcj9JgGbdvmGGobOSRif4z4nqKU9B0INO6LirtP5JYEuW3K4zS+MmT8g2NsuHRBEJ++iRUfJnMS7V/mGql9wJYiLOGf9bfLb/zMLPnjaV47e5Ev4jZYRBZTWkYtd46LtLt56lZD0RarIhYOMrxqCyk7LIm/lYuPp7i/DcdJgvjsqAQHNmZdh/RHVk2xR0mDb8X9LiVHo5dDzMImCNDKJS1/pQu+zQ6afJ1nsEAcn4fiCMKyDfY5bUudZcZFqv8FGLc1vyK8sjpDfrGi8ePTkjmqaZ8/lemGGj3d9mpn721bWhhMg5ZaWhC7zxEUhIXrxhA5pERJHnUkc25u+JImjB71iJA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song wrot= e: > > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > // Here it got entry =3D S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated fol= io > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count =3D=3D 0, it can be = freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio =3D filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte =3D=3D S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+EJ= t+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song > --- > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..a1564d205dfb 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > #include > #include > #include "internal.h" > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > struct folio *src_folio) > { > + swp_entry_t entry; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > + } else { > + /* > + * Check again after acquiring the src_pte lock. Or we mi= ght > + * miss a new loaded swap cache folio. > + */ > + entry =3D pte_to_swp_entry(orig_src_pte); > + src_folio =3D filemap_get_folio(swap_address_space(entry)= , > + swap_cache_index(entry)); > + if (!IS_ERR_OR_NULL(src_folio)) { > + double_pt_unlock(dst_ptl, src_ptl); > + folio_put(src_folio); > + return -EAGAIN; > + } > } step 1: src pte points to a swap entry without swapcache step 2: we call move_swap_pte() step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swap = entry have swapcache again - for non-sync/non-zRAM swap device; step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src_p= te; step 5: do_swap_page() for src_pte holds the ptl and found pte has been cleared in step 4; pte_same() returns false; step 6: do_swap_page() won't map src_pte to the new swapcache got from step= 3; if the swapcache folio is dropped, it seems everything is fine. So the real issue is that do_swap_page() doesn=E2=80=99t drop the new swapc= ache even when pte_same() returns false? That means the dst_pte swap-in can still hit the swap cache entry brought in by the src_pte's swap-in? > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pm= d_t *dst_pmd, pmd_t *src_pmd, > folio_lock(src_folio); > goto retry; > } > + /* > + * Check if the folio still belongs to the target= swap entry after > + * acquiring the lock. Folio can be freed in the = swap cache while > + * not locked. > + */ > + if (unlikely(!folio_test_swapcache(folio) || > + entry.val !=3D folio->swap.val)) { > + err =3D -EAGAIN; > + goto out; > + } > } > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > -- > 2.49.0 > Thanks Barry