From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6CB7C5AD49 for ; Sat, 31 May 2025 03:37:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 075226B013A; Fri, 30 May 2025 23:37:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 026236B013B; Fri, 30 May 2025 23:37:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E597D6B013C; Fri, 30 May 2025 23:37:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C69E76B013A for ; Fri, 30 May 2025 23:37:28 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6791EEA5CA for ; Sat, 31 May 2025 03:37:28 +0000 (UTC) X-FDA: 83501792976.13.6F1EC16 Received: from mail-vk1-f172.google.com (mail-vk1-f172.google.com [209.85.221.172]) by imf25.hostedemail.com (Postfix) with ESMTP id 90E0DA0005 for ; Sat, 31 May 2025 03:37:26 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FLCOB1YQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748662646; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uGwZBMRFt6jbCLRVxUb7j9waKAG8jOd2Lwyhl2zPBAk=; b=IOySkEVgySVoEBvdFd1wHzkS7Ec5G4SfVfxAlDikVap94+fx95AA8PSKbNedWQgw+rIagY L9qRyA9jMDV22ivmOFNwdluPdkc273rQJLL7yjZmHAOWeqzqb/oPIfhlNR6WnDpFLBBRBz sOKghXGVJyFwA/eBGptOe25a0zA7Em8= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FLCOB1YQ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748662646; a=rsa-sha256; cv=none; b=X8EsPPVvJRDdv9yovdEKuHv3dTFtwV+fF0e//7HXdnEpupU5e1Qaw9Lwv5wHVBbmAMsQjr A+YUPslBQjrgvVCLQi0UZdyxNjGUZ/3aZozi2gFgcrCsK9VMgQcLole8iVaNLJqOH5xUJe kTVP2lkg8nJqgJwD3lxbeUmTYnwbCUk= Received: by mail-vk1-f172.google.com with SMTP id 71dfb90a1353d-530802fd5cdso559448e0c.1 for ; Fri, 30 May 2025 20:37:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748662645; x=1749267445; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=uGwZBMRFt6jbCLRVxUb7j9waKAG8jOd2Lwyhl2zPBAk=; b=FLCOB1YQRojjPGjdpL+l+Xed7BruAdUkobHaOxzlmtJX/HUv/HSSexOhiKmhzey15K P+H/WLbSrE49FARMTcjEnLR8qj8YpkToUJyD4ZVf81PtREGrzfjeUw+c0I5wtDc4dx8A fXKRTVhM3DsTrzKVSS65sA0EVLRkBhoD9afsl0wTcLf20UuvlEPmA5gQWBD0XeC68ppw m/1RVfQl80ExhnvCscQvZ0XFEiVQwARlhTENGedPnjp7L+PVufGoYmcTMRewgMZr4dU1 8ULKl0pHGw8lHtsNEtHTe3umiP/P9STr6bWaiJ3CnvS5vKHft8Ihoe/h5wwN5+aVBbf4 jAeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748662645; x=1749267445; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=uGwZBMRFt6jbCLRVxUb7j9waKAG8jOd2Lwyhl2zPBAk=; b=IoXjGQEJLJ4/tFr5vmqQ7U9/7cl0cJpFYNqetyjcL7F7nEWfvejxD6vELmpbZ44+tx xouWs8VPz07RZX6tX2jj0xfzgNp6Aq8f1QTY3bW41+fHwaQ02GLt4fzVJeFtOMObMdfM IX52i4t+7f1OJzF4+Wc9u9jY2iFPqCY32w1vGU2+LKomnKRm5AFN7ReVSV29ZN4fC7P/ aWY4nOdG8urW/7e/LqXbuzT6TexLoY3X67vp9XHq4AjnbTjIvVQsSchzCqOTFhHv/why 1L3WtD2oTJgm0Ej2zLI4PRqPiZoCliDFCknHOGpOw/brQbRTEmyWY5f0x6f3ZRysaNtH 8kfg== X-Forwarded-Encrypted: i=1; AJvYcCXjabfsc/LdxN+nH/iIox3ZmR8XLNIkqCWitZbcUKsFcaDsKk+OriGbW0iKw6KhxbjvvTvLgXaIAQ==@kvack.org X-Gm-Message-State: AOJu0YzQMyCGvDx3nElAQHWHQLXFZMTpxLCyG1n5Jc9nvTuTraKMGtKb eFhM0Hjh+EDyaTIZpOyQRHPFJVVMpLmVXSMsCRCHJQ4uEqS+wf5PGNJhx7/jrT35i2skAfCqzRl MapwJm/JoEmR3741uDzvKYkgUt5SbRr5CcHw7 X-Gm-Gg: ASbGncvOt1vYEWQT/huL5jGwGlU/rK2dDghXP0BRJhLkvHXCVyk2JDvK44UIMeVhn25 V0KjOW1RFeN+nP2b8OhtK+hz+n1G0MU/VW47Qr1kIm15KGdRHJ2mMMB+RZrSM7f9Lk3NP+EgnCs 27DTyQVIUKbAt4//KGGfgP1J38BlCLauHKww== X-Google-Smtp-Source: AGHT+IHq1tb1kp0wioRxhcquhpJ4/PzMdUDaAUgOIsgaHJicmUM65/TwwdvJXoRdST01SoJh57N3rUaJX43EfrtmD0s= X-Received: by 2002:a05:6122:3c93:b0:52a:ee1a:4249 with SMTP id 71dfb90a1353d-530810a4c4dmr5992647e0c.7.1748662645483; Fri, 30 May 2025 20:37:25 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 15:37:14 +1200 X-Gm-Features: AX0GCFuSmuG3GMYnK0fqlEZH7Yq5qdzlPKHVTdmddHslftb2Tjj-Ax5NFqXl97I Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Lokesh Gidra Cc: Kairui Song , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 90E0DA0005 X-Stat-Signature: wf43884nxwru1mf85p3u7ho5sfug79d7 X-Rspam-User: X-HE-Tag: 1748662646-626256 X-HE-Meta: U2FsdGVkX1/jZDCwhS182/wn5NyBSingZG+mPjEiC+K9EsMz/XthlJGZrp1lnx8BTTQxZ9G8242Q0y+l5ouKJn7q4HJBq/sFSHTC9CWDLtdlUEvybHHY60lWSWfNiDdayI8g2JnMc+YkD80MHX+UtzOlpu7kyUnwWx6/o6YOPwpj+uXP/3S64F5biK+8ig0wOSnd/KftfOdp//oJk/XhOc9r0Q5Pzq1xKTAD9ogaclvp8C51jPl4dsf13lZehdykV1tViLBmBePPoP3jbvfjrsg/eYsY0W4pu2i9dW8rZKebNPMVBEuNFfskI5K6xMCPBB3HLV3SRbC945bKO1C+Ab4tHUk44Dp7CaVhqldIMpu3UJOo7407dsYhKnkUaF3kOFZpo3NfespAKSihDIVd20kxrYRpBIGeobRGIcf0VlpmctmWeMWnk7+wy8TSxmTdq8Ss6RFtwfBnM4IhBGSNT4q5tRGl9x4ggcud2re9fXGAyp8bJN9jZpJ+28nwPBROvj7hfHs3T1c0OzhN18HOXGHPMp0xKK3/6XIPsX9Yi0R4mylNL6Iitz7o1zPDn7XB5gobm0Zxcea5k20KFMjpGUc1CAx5/z+S7QT/WfQp8yXyhWoeU6YDsf4avc6nhUvdM2RjdF4qNpP41kr9utTYbGvGZiwrRusHUg5sqlZ+I9H4i67qdKaRS3LRBBdPQCAydaRcefhbGrVZP8qfdLdirHTiB114nVSrGQn1aTlaF8D4xaO2sqIk0BBEh5+3XgPNgu+5iDeGGQqdvvphfvSTomMmTJuizDV6s72BBky/Eqjf8F4nGcl7eXCKrYSwXZZYPf9691K5gTtM6NYp//e0/fRAIknLEedxJcYZFGDLk4Z5EexA18OPZ0f20rGPykcNgTYjXeC3dPrbJOfS1s9vh5+WJHsjSGRhmW1Dn8XWYdAe7Y1rubs/SIg7zOlHe3w2+/dky1143suszOyj2Yj BUBZYlyU sXTFfyfhjMYSxnxchzj+yNhZdXA9JcS5A2lPwxgS1X9MeF4Dpx8jeqLKqIHsx9ECNynVOmpmUomf1Qs17yUeosOfiWbqq+gQk4ofTWPke6qbYP4YxD5PUPEKCOUpgkkhkbORItSLm5XC8VQLNk0wDH33NK2w9Fx6l0Lugc/zuYkRc4/zfdiiCybpF0zIo7xRgqIsuv62AHhcg3hea+ISFbr35yrRJ9D7UTQkUulBpFXe8R8XMzSMq2ll/xuR2Yc6Mll4qNJgOfw7O4A58mmyJTLtcHe2Ekj2cpfkRM1B4qBA5Ea0+7awEefYGbF9Wt8J3aWV05VE5yaCIj/nUmssGNGytNWEgSymP3TsKL6SjWa0mvaoxps2No/itYs8BuEklMJn5CcmGeMpMlCtsRkERbpYnRkjssqvXFYAObMKgdxLegvv7VeIXgTsdjj1XdNhpSru0Kpm3r3K9kwA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 11:40=E2=80=AFAM Lokesh Gidra wrote: > > On Fri, May 30, 2025 at 1:17=E2=80=AFPM Kairui Song wr= ote: > > > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > lookup, and try to move the found folio to the faulting vma when. > > Currently, it relies on the PTE value check to ensure the moved folio > > still belongs to the src swap entry, which turns out is not reliable. > > > > While working and reviewing the swap table series with Barry, following > > existing race is observed and reproduced [1]: > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < Somehow interrupted> ... > > > > // folio A is just a new allocated f= olio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B could use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < Somehow interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry pte > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it can b= e > > reproduced [1]. > > Thanks for catching and fixing this. Just to clarify a few things > about your reproducer: > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or > MAP_PRIVATE will work as well? > 2. You mentioned that the 'current dir is on a block device'. Are you > indicating that if we are using zram for swap then it doesn't > reproduce? > > > > > It's also possible that folio (A) is swapped in, and swapped out again > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > swap cache so it needs to be moved too. In this case we should also try > > again so kernel won't miss a folio move. > > > > Fix this by checking if the folio is the valid swap cache folio after > > acquiring the folio lock, and checking the swap cache again after > > acquiring the src_pte lock. > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > we don't need to worry about that since folios only might get exposed t= o > > swap cache in the swap out path, and it's covered in this patch too by > > checking the swap cache again after acquiring src_pte lock. > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Signed-off-by: Kairui Song > > --- > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..a1564d205dfb 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include > > +#include > I guess you mistakenly left it from your reproducer code :) > > #include > > #include > > #include "internal.h" > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, st= ruct vm_area_struct *dst_vma, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > struct folio *src_folio) > > { > > + swp_entry_t entry; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + /* > > + * Check again after acquiring the src_pte lock. Or we = might > > + * miss a new loaded swap cache folio. > > + */ > > + entry =3D pte_to_swp_entry(orig_src_pte); > > + src_folio =3D filemap_get_folio(swap_address_space(entr= y), > > + swap_cache_index(entry)); > > Given the non-trivial overhead of filemap_get_folio(), do you think it > will work if filemap_get_filio() was only once after locking src_ptl? > Please correct me if my assumption about the overhead is wrong. not quite sure as we have a folio_lock(src_folio) before move_swap_pte(). can we safely folio_move_anon_rmap + src_folio->index while not holding folio lock? > > > + if (!IS_ERR_OR_NULL(src_folio)) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + folio_put(src_folio); > > + return -EAGAIN; > > + } > > } > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, = pmd_t *dst_pmd, pmd_t *src_pmd, > > folio_lock(src_folio); > > goto retry; > > } > > + /* > > + * Check if the folio still belongs to the targ= et swap entry after > > + * acquiring the lock. Folio can be freed in th= e swap cache while > > + * not locked. > > + */ > > + if (unlikely(!folio_test_swapcache(folio) || > > + entry.val !=3D folio->swap.val)) { > > + err =3D -EAGAIN; > > + goto out; > > + } > > To avoid further increasing move_pages_pte() size, I recommend moving > the entire 'pte not present' case into move_swap_pte(), and maybe > returning some positive integer (or something more appropriate) to > handle the retry case. And then in move_swap_pte(), as suggested > above, you can do filemap_get_folio only once after locking ptl. > > I think this will fix the bug as well as improve the code's organization. > > > } > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, = dst_pte, src_pte, > > orig_dst_pte, orig_src_pte, dst_pmd, ds= t_pmdval, > > -- > > 2.49.0 > > Thanks Barry