From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF45CC5AD49 for ; Fri, 30 May 2025 23:40:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D6296B016D; Fri, 30 May 2025 19:40:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AD8C6B0170; Fri, 30 May 2025 19:40:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C33B6B0171; Fri, 30 May 2025 19:40:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2C4356B016D for ; Fri, 30 May 2025 19:40:26 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4D0DC1CFCFC for ; Fri, 30 May 2025 23:40:25 +0000 (UTC) X-FDA: 83501195610.27.814D1B0 Received: from mail-ed1-f44.google.com (mail-ed1-f44.google.com [209.85.208.44]) by imf17.hostedemail.com (Postfix) with ESMTP id 65A6040009 for ; Fri, 30 May 2025 23:40:23 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=T6H3Yvoi; spf=pass (imf17.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748648423; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hJjuAkTXEQou27q/GI8BouUWZGCA56NQMXBn0TGJIBM=; b=pfVuexYGnrdwelpBz4kokQ+ZfFwPoiAaHdJ1x3fYDa/bCQCcsOdpQE63rwe74fIG5xP70k q8Dssvm3Ea5wHuKA1yuet/LL4NjWqc0x0YRjLaEnF7E+ryFhht7WwaSn1R+HyeFlsXwOEO mscw7f7MyPrRf3JcDa8XpGoR+9ekPvc= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=T6H3Yvoi; spf=pass (imf17.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.44 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748648423; a=rsa-sha256; cv=none; b=S0jr2Fre/uaV83lKzVcDCpONATOOT2eRqJQo1IHrtx6G4SSsmSVpA+AMlFZiQ0fbDebD+v dWq2ujPMLY6AmH+XaSI77voLh8DosDWEBGh1x0MWfrihg95r4RYon+sxPpuqFX8sjXpaH9 tLer5zVfbAZztxmW0zmH3at6sTvcmkk= Received: by mail-ed1-f44.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso2350a12.0 for ; Fri, 30 May 2025 16:40:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748648422; x=1749253222; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hJjuAkTXEQou27q/GI8BouUWZGCA56NQMXBn0TGJIBM=; b=T6H3Yvoib7YG8F+ta2UV1n74Y9JsuIYqevFsmT9dsO6mHeLi0fF+IPgTv27OG04j7j 81GECrx6GOOVK3OMmoeKnE97ohXt26TSRVDa5zYWEWsfhXbtXJYeKP8MpVspnJEHUGsy hvuaSwF8BGL7YfF0ASJg+XyRT4WSKp1uP52EQBR1+au/TjAbQjAjqDLD8Kftodr/FgG4 SbhuJvwVg/2Uh34zgeR/KdqGVWjwo40IrN3Ee6/RkbsCH7vmwR/D1zSHDkTr6SO/clWN ph8KpNQAZnVkbtq6TRpBt+TxenpnSmSfG6SAwJgFwe8N1nLTfTfs1t0LnGRRJckfwf8T XWKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748648422; x=1749253222; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hJjuAkTXEQou27q/GI8BouUWZGCA56NQMXBn0TGJIBM=; b=EesFm7Xt3WSPulgYInv3eq2hwl5F1WwCWLKlIDGlcRS81NM2ROQ1+NcDUiVFb9LTMT gswLQKWrjydeR7Fm8A1tSiG7wX+ADcBZ3dF+GHS5c+jhvWuntTWFBgn+Q/zESebRGS9+ j2q/ODaNF2kSZ0gUK+FSy2s7j2WLHI0+bgAPcFU3k+Zg/IUO9QNN0LJMYUGkJvyi+DRP BL3zoZwSJosPWN+1m2JqiXEjfkzIHyi6uOF0DqRc7GrGZgFiFcmX5cM7NDFoB7/zwWC/ 5Y6WhMyQUPmXHNcKzz+LPhQ+NldA4uaWsJFZ1/8jIuKrXwiAQ2nRFORo7q3O1M/Xca9+ oZeQ== X-Gm-Message-State: AOJu0Yx+zZcec0kw4YxyU0eV6YDfgbip/QmaqGu57bKa8txHprp5Gr9+ vK6xTunu6ZcCFUlWf/2ggc6kR3XShZTqi3A6tGmBTbyKXef+H/zhM0ZFMnEeWQxSTMAcJV7cU4E Gjx4eGESFxA9X72TCMmkLhL47gckUJOU+99dS3F5m X-Gm-Gg: ASbGncvejdmPECyPtaCK1qFyGw3iEI9lzEKhFMLHc9byQ8YslTn8RcH+jRYOlzzTswc WawEzt+qFvD5FD4A+eHgx86YxNzaW8QMgUcdT5Urbg037Y4or2AvUklZBpWE8uVsGQ757AjlvHo qLtG2236RJuU0Muic3YJJnKZNcAiOhz+wm02kV+maZ6Z6aTFI+9MyNJicTkCJGwCYQ2ONNbjbje A== X-Google-Smtp-Source: AGHT+IFYZ2wi/n3k5Mm5Xv/Aw6N2AACLV49l57jm0jIvWwkQlDhqBVGKwrBN6CXIKTTpiKFIsm7Hk5hypG0XE0qujZ0= X-Received: by 2002:a50:baa3:0:b0:602:3bf:ce71 with SMTP id 4fb4d7f45d1cf-605b3d6f8e3mr11239a12.3.1748648421564; Fri, 30 May 2025 16:40:21 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: <20250530201710.81365-1-ryncsn@gmail.com> From: Lokesh Gidra Date: Fri, 30 May 2025 16:40:10 -0700 X-Gm-Features: AX0GCFtij8ZIUc4Py06G-QyGJk9y9sdcE5SJ7fAIvFEcuZp83ykEJOh9QRX6Z9A Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 65A6040009 X-Stat-Signature: xqatw6ttjnbsr8p6gs8pkeu1unmq4rdk X-Rspam-User: X-HE-Tag: 1748648423-219967 X-HE-Meta: U2FsdGVkX1+rB9L1DHzSqPSYhiYxqPdZN3bN9KxFgdnOdt7cPUOWY2N/xXV8AYu0cWBiV6EnsdZh/s5Ik0u/X0UzK7m0RJwFNyNeYm8Ve2tIbkFoHJn9d2h37bxSA1KRqISqrGWhzfcdSw14HyfXw+JI4I0e7guSSbnMzPtJnnMIoefFxLGjj0YjjVXZceR/I/mEcfc3IpQXmExVcbK8G1wlVtN2ddrgn6e+k5N6gCSXZpOEnYJf1o0Q4AFj+2JOOgc6FMX2HcPUdeAbF5TQeHMpo1/Mg0J7NXtkGf0eAyN/Zj/2F2CL7q3E4ccyIA8y6Mm76Z/oDSNXJrPAvLDIjhQhvOzbXIn7qLfYVrsLf4g5iCVYWYgfG3TKhv3ZbsZceodYsBAq1xySlXrO2o7F1KDyeIqKHYw7kcCSuaNpx3c4vN4CxpGI1l7GWhU5vSKjyHDxIs9VwSXP1JXHzBpZ4Wkpa/lxhEaX1sz17f/W/lVg0zamDBhsIN0q2Bwh+w35GyvrLCr3Dvs9kjDOg36AZvJzwdlbf2IXZzdygEfaaOFTwKr2SVxnHZ/CD75Q810N3GD8woCPErL9P6Fo2KctVhtkzs7zaNJT6geQu4618BXWMfKxotKPw1FCMX0usW8Y/P1elCsDC4EbnA8RwgaQikJeCx4cT6ULRZeR2V09IYEGsJulo2tI9on2QyWlmNA5RC/sXGrMKd91JcVp8/T74LkTE0QWaNdbXdlIcgtQzhLiPXmbpCLM0+vAy8jQWgL56+LcjekFjtzL0WivuP3nzCKCaw5rZVHyuNlpQpgXGL3jZ8Cnm5RuEIVfb5PLDGFAjl3g6mtR3DFqfNA4U6TodMfPdIxuZQIn9E5GYUWVz9lsPTmInFu12clDuq2dUGDO9FkRDsxoQD7zOR566Akv3XKzKL40qPZ9STJZpIDx0CNMaktXaQXy4P5sNgEyjpuB80MLB4LvpjsX0o0lxSp r/OivvCc MQNe4DXeoucvR73vrFfxcUldFwSrPtCtsVzmlgQSFxSZ5ySpTnflprI6cR62VZcdr1NV0ZRZjuJFc6IsHcQBihpkSLsaNYG40ptloSanJSudksBKcK4DPetqcZ7ccWZR7tUnxEUobq11FHWs8FxDawxZL8hVbRN8+LZ8mUeYz7xbK9DfP/ITFVuh7O5dORqPj94WSFG5fUD3ckh72u1kotZlzCcW64BJIYMEILkp3jsxlWr8mmLataTgtmJUegHLVhS1RG7xwS9YlfxAvfLfmMKqANyNAHDRmvFhhCqUtYLcfrZNNzv13dVcNMg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 1:17=E2=80=AFPM Kairui Song wrot= e: > > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > // Here it got entry =3D S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated fol= io > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count =3D=3D 0, it can be = freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio =3D filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte =3D=3D S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. Thanks for catching and fixing this. Just to clarify a few things about your reproducer: 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or MAP_PRIVATE will work as well? 2. You mentioned that the 'current dir is on a block device'. Are you indicating that if we are using zram for swap then it doesn't reproduce? > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+EJ= t+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song > --- > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..a1564d205dfb 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include I guess you mistakenly left it from your reproducer code :) > #include > #include > #include "internal.h" > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, stru= ct vm_area_struct *dst_vma, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > struct folio *src_folio) > { > + swp_entry_t entry; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > + } else { > + /* > + * Check again after acquiring the src_pte lock. Or we mi= ght > + * miss a new loaded swap cache folio. > + */ > + entry =3D pte_to_swp_entry(orig_src_pte); > + src_folio =3D filemap_get_folio(swap_address_space(entry)= , > + swap_cache_index(entry)); Given the non-trivial overhead of filemap_get_folio(), do you think it will work if filemap_get_filio() was only once after locking src_ptl? Please correct me if my assumption about the overhead is wrong. > + if (!IS_ERR_OR_NULL(src_folio)) { > + double_pt_unlock(dst_ptl, src_ptl); > + folio_put(src_folio); > + return -EAGAIN; > + } > } > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, pm= d_t *dst_pmd, pmd_t *src_pmd, > folio_lock(src_folio); > goto retry; > } > + /* > + * Check if the folio still belongs to the target= swap entry after > + * acquiring the lock. Folio can be freed in the = swap cache while > + * not locked. > + */ > + if (unlikely(!folio_test_swapcache(folio) || > + entry.val !=3D folio->swap.val)) { > + err =3D -EAGAIN; > + goto out; > + } To avoid further increasing move_pages_pte() size, I recommend moving the entire 'pte not present' case into move_swap_pte(), and maybe returning some positive integer (or something more appropriate) to handle the retry case. And then in move_swap_pte(), as suggested above, you can do filemap_get_folio only once after locking ptl. I think this will fix the bug as well as improve the code's organization. > } > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > -- > 2.49.0 >