From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E50E6C5B543 for ; Sun, 1 Jun 2025 23:02:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 359E86B020F; Sun, 1 Jun 2025 19:02:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 30AEA6B0248; Sun, 1 Jun 2025 19:02:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 220956B0249; Sun, 1 Jun 2025 19:02:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 08F706B020F for ; Sun, 1 Jun 2025 19:02:22 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BAB24161586 for ; Sun, 1 Jun 2025 23:02:19 +0000 (UTC) X-FDA: 83508357198.18.D747791 Received: from mail-ua1-f48.google.com (mail-ua1-f48.google.com [209.85.222.48]) by imf19.hostedemail.com (Postfix) with ESMTP id E09931A000E for ; Sun, 1 Jun 2025 23:02:17 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=N94D1Lj+; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748818937; a=rsa-sha256; cv=none; b=ZkW+YmVfhebcQd4uXSqkpMUcaEBnVHOFdyGFso0msou8JpBRUxdzvGmlJtcW814hykDXpc 65F+a2vDE7CPxJg+1dcESPrE6EUaYTcw+siyto5Jtdc9/k1aCnY2vA+T3Xxu1UIs/TE1PE k8fhaeTf351BNWXaL9QjURJH/B297q4= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=N94D1Lj+; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748818937; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qJJRvvhnj5WH18QGeEqnBpAPkYIwRDRaywa/X23CxYE=; b=x8CeW/RZ/MXmvG1W6vvBSJk4cWsdG7co/4J0Rn4TD60RvefJvJ7Mak9DcyQZWTD8JifHIu 42KuLPhxofAOScUOkpQdEi6AJbl/h5i7qDP3FCBMfIXK2kQN+ccWznB5T18M1r+u4rtHzw VrVl+txhcOjEG+mieA1juJj6KYCM1sA= Received: by mail-ua1-f48.google.com with SMTP id a1e0cc1a2514c-86d5e3ddb66so950837241.2 for ; Sun, 01 Jun 2025 16:02:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748818937; x=1749423737; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qJJRvvhnj5WH18QGeEqnBpAPkYIwRDRaywa/X23CxYE=; b=N94D1Lj+qJ6jRE6NWKSXwpuX5uamogiRtVjWchIEKn5lue5fb6RarPYiO0pTqNJIKy 9miHWhwXBiU+U3gYz7Cn7V6y0pPCdA/qYEENoOC8ykm49cTpxiHRDM7CAsMEOpBGL/CV mL1XZxdU1ggmfRZT0/4JFj5AbLs5Xo1TgXl+W3P7EdOeXeJG/7Aaj8gzOA2vZz01wuiz WYJBIux8ZX+8EYRlJHOwez+QrQf+8pfwnJ8AavFl11G5JnKdTdrYi7q4RAET8pfW0gYA lyyX4y9CCMFwANcjmp5hKqpsvBZwO6tpiuIo04/lgjHqZf9FcVJyw9fKJuaLN5lqL2tT dVGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748818937; x=1749423737; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qJJRvvhnj5WH18QGeEqnBpAPkYIwRDRaywa/X23CxYE=; b=UhLzJmk931nfaylX/1cQr2KN4DFeqBtVMNUz+c97sANfjn/0+qxR7G16XxXT/KUSYT +niS+qEzHh9KZNKAzXGWIy8nlySCRYPIRIYAqUqeVK+7NDi9TLuQYGvUL8v/yaOwOSY8 IhVVVzUifR9j/zcijkTZpRiiipbkEcnDb1L4zrpSvzBV7/rZ6gX0tm1hadnx+0qQTGcS 0sB7mglHyg93VK95qmg3yHJtATjMqfOylhEf9g2hFNHBX3/miMVQXe1zE3g6h1S5C/zH 6so5qaxgojSalkawpCTTkgNFceVD5hrUclIZ92SbfWDb4QBlXNCB/hWf+xXoNuR7kDTy 9Ejg== X-Gm-Message-State: AOJu0Yw2h2GQXK1ITIAHLViarfH18gO+PEv2XepeLAbCyZYjf1ZIU3h8 PZQGenKbx8uP+0c7PO3CRbTnh8cVRIC+oVsgK/qDCG+CEBICMZvJdJfjBfNlyeoqVy1+ERA/CLf L4xItW5vPtRO1lTIgotDdy0DdPHmLMsE= X-Gm-Gg: ASbGnctBbc4WPW1szBpJPcw7n1gCJKzsHc38RkKIqd0Z+Ta43uu7o7scyWKQum+LEpJ CLy0Okc4eXiEb8NB38TZCjqXr9ID5vAQcAZnnRexNUokP/4fJ/B3+iu3ub/bSPXDnR+2yoMapFL Z1VaU4oSjoKit+V3l0tkRHmO1Gtu9fcULRBXuFRQtqoHIj X-Google-Smtp-Source: AGHT+IELGfT1Ic//2UYT45sllDH/ci5tp1iJh4PCxvZHvbSMadfe5Y2bffIwmTzhw0I84/BPmw/h8vYYpohiynmWfmI= X-Received: by 2002:a05:6102:290b:b0:4e5:9daf:a2a0 with SMTP id ada2fe7eead31-4e701ae3ademr3698108137.5.1748818936862; Sun, 01 Jun 2025 16:02:16 -0700 (PDT) MIME-Version: 1.0 References: <20250601200108.23186-1-ryncsn@gmail.com> In-Reply-To: <20250601200108.23186-1-ryncsn@gmail.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 2 Jun 2025 07:02:05 +0800 X-Gm-Features: AX0GCFs6WlgA9NVsiZs38jh-3AdoM_5ASXeggaJLIZtFNyP9ttgK2VLnTvUN8tY Message-ID: Subject: Re: [PATCH v2] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E09931A000E X-Stat-Signature: 4kpn774jucp4je777a4j9aj6jibe464f X-Rspam-User: X-HE-Tag: 1748818937-694054 X-HE-Meta: U2FsdGVkX187a4Vg+Lh5UccbgRHAVa/G3+q6o/lhK2RBlwYqFnwyoRoX8IwE9/KJLfvDIgn2ExQfBV7BOgN8sE5Lt68iVYiZ0SgSjFz7ix3HHLwn+YmSiDmnns+4QRJPfzqL00vWJlDjlsGK1LAjsxvdUkyftfQ1L+GFb7CtHNs4SK031cfOwQp7T7wci9UdqXlzHdF/PbdVIKXXL5TKdllyRrJrUwLKDnkodN1RHwU3FGzVNnMT490JUcqI67lSOdrvWsrJ3xKIjcRf0tIXX6RP20qdfJU/P5JdEEush7Z+xI7simgYlJzRLc3Z/nvjl6MMhX11xUHkDEwojE3BnQBYmG+TgGyAUn4ZK6+KdBcl9ocoavh+8mNhgeJxlBsKwQhsV+YzkUvzYQ4rjOMGE9Q0StmRlDOcYN/eWT51Qk3WuFxAC1sbyxkjizlTeWvofA6IDxaM+nLiJeU4Mfa8LTzwB58bTgYtAhiFYa42nSEpZ+oY9nHaIqIbZTciPVn66E2eVkA2FT/r1H1UqP9DC0ahkf78NhfPnzLvczKTge3amr7bFFvjI1ef3TRK3sJr5EE5owTSKu9CWzuJ3Oex22iUQw/f+GbvgSdxL/3A22DUVblG120Ke8fzpGKHVkXQkF5Fw493yrq0adF4WWsK88LldncjLmb23gO4+9BeYzZhVxVoRo7OmyjG8TaJXTLQyVPcxca2iOjzvV0FgMqjbLsQCSr5NkR9ZcwOw5tOkVdEdlcefAGCVadI1PM5J/EVcat2f7JgzQJHVqbqwmGOak4QrXkNU1l7xPzmRzeYnVqDmUiaSQvz7YVkvWsZoDzBW6UandCzipuTcQqk+sCNx091e7f8I90bib8sb2l+sMxNbxcedZgEbOZnoH1ZRw0tIVLHiJtC1T8jfmZL2WDqUiri+Zrk8LC77irxYC4PAZ4m1HybbgUPyHN1wtUAkBe82D8uNU0GpWc553ViNsT Np8000wt trvsz2zHG/JjtJGkQqWWctRJTdVeGnBtg3C/WUTCF2SOQgTSHhtCPknIDi9B+7FvYKkfObP9FKH+bxx9qt1K71GBKyRceetNoQdYxe+Z9saiwpYOHGuM4pKTn7tuiL+bCAkJ21UL9US5ZGjo7XFc119R10wllTwD+OHYrdGQ5muzHQYJmH7cdwOE0h+econtwf7T5zd+f1BkDUUyq9BehKQNLHtVvjnXtfDaC+yP76degZnQ7cc5Y7QygRBs3HyG1VSAUEJRobPdT6FzL7YJetud9jJhPGU54JvdoL5Cjt5H3XBR1tIByz3zuvaoBcTML7mtXALewrmXDPuEahWJp65DVq1R61a0eZ/sTB9SWYa4bFoA5VjuIB11/gZvG9veAruejaZB9oCGNZacjJJYcpHbFk0azCp+UstFe5Z+RDbgslku09KQgUwrqWA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 2, 2025 at 4:01=E2=80=AFAM Kairui Song wrote= : > > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > // Here it got entry =3D S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated fol= io > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count =3D=3D 0, it can be = freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio =3D filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte =3D=3D S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. > > Testing with a simple C program to allocate and move several GB of memory > did not show any observable performance change. > > Cc: > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+EJ= t+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song Reviewed-by: Barry Song > > --- > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.= com/ > Changes: > - Check swap_map instead of doing a filemap lookup after acquiring the > PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gid= ra ] > > mm/userfaultfd.c | 27 +++++++++++++++++++++++++-- > 1 file changed, 25 insertions(+), 2 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..a74ede04996c 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1084,8 +1084,11 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > pte_t orig_dst_pte, pte_t orig_src_pte, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > - struct folio *src_folio) > + struct folio *src_folio, > + struct swap_info_struct *si) > { > + swp_entry_t entry; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > @@ -1102,6 +1105,16 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > + } else { > + /* > + * Check if the swap entry is cached after acquiring the = src_pte > + * lock. Or we might miss a new loaded swap cache folio. > + */ > + entry =3D pte_to_swp_entry(orig_src_pte); > + if (si->swap_map[swp_offset(entry)] & SWAP_HAS_CACHE) { > + double_pt_unlock(dst_ptl, src_ptl); > + return -EAGAIN; > + } > } > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1409,10 +1422,20 @@ static int move_pages_pte(struct mm_struct *mm, p= md_t *dst_pmd, pmd_t *src_pmd, > folio_lock(src_folio); > goto retry; > } > + /* > + * Check if the folio still belongs to the target= swap entry after > + * acquiring the lock. Folio can be freed in the = swap cache while > + * not locked. > + */ > + if (unlikely(!folio_test_swapcache(folio) || > + entry.val !=3D folio->swap.val)) { > + err =3D -EAGAIN; > + goto out; > + } > } > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > - dst_ptl, src_ptl, src_folio); > + dst_ptl, src_ptl, src_folio, si); > } > > out: > -- > 2.49.0 > Thanks Barry