From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D04FEC5AE59 for ; Sat, 31 May 2025 04:42:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B96436B013E; Sat, 31 May 2025 00:42:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B6EDF6B0145; Sat, 31 May 2025 00:42:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5EFB6B0146; Sat, 31 May 2025 00:42:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8663C6B013E for ; Sat, 31 May 2025 00:42:09 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C1C211D455B for ; Sat, 31 May 2025 04:42:08 +0000 (UTC) X-FDA: 83501955936.26.8EBE4D5 Received: from mail-vs1-f46.google.com (mail-vs1-f46.google.com [209.85.217.46]) by imf21.hostedemail.com (Postfix) with ESMTP id 1F68E1C000B for ; Sat, 31 May 2025 04:42:06 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UoiBazy2; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748666527; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=V9EBOQZoadO8Tdn5XLSCzKniJvmsxChEl4wEsHQNLR0=; b=ZV+khuo9WDNEIJnbkeu5t/3AMnXy9Ne2aUXE3uE3Yz4y8fNJC+9iUnTnn1d7tuUTHeniAF 71BczBSO+YlsOBByn0DqvcvZ/l/Q+YXUmpZWWlqW4n4VCEx4+ANu91rJ7vgLfMErif9pzq mjaOy+jyIVjNS1LgDtAclqU5NVGJhNM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748666527; a=rsa-sha256; cv=none; b=rhQG9YLnuou+DkXNwMN0V0Qa915nINaQk7+P8R+eE+e+TnQuPG0l/jdnK6VT/CuiwG8/M4 udt+rlcRhjHOoRo9+dMZWvxtsdtdv00DiuO160bDmoCHWIcfaG6ojalppQsQE6wTb0REI8 qT9830qvALqZ62zRX9jzU3zw0T3ROUQ= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=UoiBazy2; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-vs1-f46.google.com with SMTP id ada2fe7eead31-4e5adc977ecso806755137.3 for ; Fri, 30 May 2025 21:42:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748666526; x=1749271326; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=V9EBOQZoadO8Tdn5XLSCzKniJvmsxChEl4wEsHQNLR0=; b=UoiBazy2z8b7UWLxK4BDHueYVxL82HfFiYLL/qt6hskio5nR+hoIDYGp3zp5cYSAPH LwaAT8hutMg4Q/om6KYNGNg3CHlrsx5s4JvCR74J5fCRPz6hvUrCKsJ/oBw9eTNfe8Fm YD8q6BWxpnfaH8I92rhTap7YpUXmm2m6GZYpQ/wXH8E8uRcJmp1jr5AXNPQ2Kljs4rK/ YMyCLtgmp+zx1yfM7ZmcW3UMluAPzMDwY+ywrZbLyycxhQmbrhsWLTMqhGnvMkKrCurt A4yOu+VeKToE51+KI7GJPwN2e+P6bGvTwnB7YU8bioXHFBa25VBAJM8pk4b4MmhfAMPH TfZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748666526; x=1749271326; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=V9EBOQZoadO8Tdn5XLSCzKniJvmsxChEl4wEsHQNLR0=; b=KdvXz0COEndTzCMcgjJXgJBPrmy7WvOVIWZAT2jO5d1mnI5Xf5q4KW1N9Pe4pE79tk +6XeQSbrt3wDZij+vu1207Klrjx3voyXYB2mWrbAIQdb2PHMtakxlnIcCQtbA1vKRJge Iqv/Q05sR1J9lisbeW8c2amKF0pUUTlsr8EfgtX3KFX48Ez8STL0ukaRsXw+8J9xEqY5 DBiDy8NPu6y2/t5XMK8NYP18+puNXwkNJdHDD222H8OyK4CwRaqoCvo/ziDNdQkvUbL7 CzGt9ooUQabPzLH8qmwmq4EHJk6We/f41LEWGMUOP4PrBpkj1GxVQ8ByaYA56IZpAErN wDJA== X-Gm-Message-State: AOJu0YxcQIpzY5WHDyjtSbZvz2HDFrYtweCILB0cS++ubPFD5y1J2yXf mz4EtWzFu/s4tO5A/NSwSBe36w9G3vWW/Q4hQF5eXUFrjvVbLR/fNOVch+ErWnK54RFD1S+1Tle TqxGWHII5S4cuLPNUd23wUY+tNsYSIAYiQ9u0 X-Gm-Gg: ASbGncs5c0y2xuTeDGhYvcL5lpGb8+CV8rbVdaHTZYqdBESIjIu/UilHQveL7i9ODQJ jugpbCZR2HTbSmBu+pIrc7hbOo59hwCXaf0XbiD4IcvyRfKxUGo/r6Ls7SHzDuXS3pF9nnk/mYz PEI+L1B6aBbnf/rztyC4fOvFJ+GsfPLT0HwA== X-Google-Smtp-Source: AGHT+IEZt3bmd4/+ChkCUkHgAFdPa9qd/ctYAKrj6kzFX3PgCnevvFkD585y/++RKRQweFgx95IXcjH2OZjWEoc2ag4= X-Received: by 2002:a05:6102:358d:b0:4c1:992c:b95d with SMTP id ada2fe7eead31-4e701bae866mr309508137.17.1748666525967; Fri, 30 May 2025 21:42:05 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 16:41:54 +1200 X-Gm-Features: AX0GCFukT232aOA-NzZjeQ0scZe458Zm-uPbRRQUphoV6tAN2n8wvIXgiMZah0o Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 1F68E1C000B X-Stat-Signature: xaz6kjbynrp8x1z9m5cg6e19ti4emj8x X-Rspam-User: X-HE-Tag: 1748666526-259739 X-HE-Meta: U2FsdGVkX1/qa2HfjwubYhBaNHNnwk2k00nFObh+iF12qzQRvcaUXAjYWIKt50srvbwBCbnixEQvP5rbGSEBMrOvkPQJo4I/EFpnGQC6wQryCa41WyMIYLvrTjI5kBm8+/ZEscNv9q77uigsKzPhcaqXd/aMoXncB4//0CjPXiYg9wR6ixplk59bMKp0l34Ho/1qgtgselVeiNUj5VI752LWh3n2t6dSZtFPUeWWfsMGKiuynA5UkAcUt4/c0rH8VbrioKe0yPUZQpzSdEyopaPnf/Lc6pwxoPARk6yrfm/vzYyZ9Fg63m5TkObIzqp2U+S+/eMAdGQ5MXbP/3AN2I6BkI6DOWKcNLaFBnqAQGYbBH2Ba7E2asuSl1Vay6yJN3p+fuHi4Mzt4MSx+poXU5J8r27pMIcTVcTv/46TSMPvsiaVFcFNwE2yH3ZguOSCF/IS2JPXGcIwFqBxclrb8rFdb1fz5TfGnp1GEMFRSiVJldkF/hmz2/rLRAhZPGP3PYrCe5riTe0B1QAEPH8C8fEDwTukdszlqvH2TWg7JAnDeFNZCjUnnUcnh0ZNM/9KF61QY7Qlt1+qW6Ygz3jTDgelUZDx29S/eMZ5lImAbT47A1z/pSL35NRO89dMpyvu16f7dIUds12QdC22Q6rzTqzPHS9CjT7PA1vTDSFeWQcCd2FG18gXPJveao3SuiRnAb6btXWE8soVg6dAxoT/hvThypm/15t+xPiZdtoGPS4IEC1GcrcKbkeTOTfaRJdSjjGm5ffJN4iHKmXYXSQ/p6ALPyTohj3a0/YiYXJXrQ00icRcwIaMmmXH0ZWJ/9UAwBIQtFT9Z/FbKP4ctD6a2Lj0K/nd8BL/jU74HuVVG4D8hd7O+TMaq+S0Q5gxkhw4nYbIj6rSQ6UuFu5HlL8+NqZzDCLoVyGfq3YfnbkIpyGFnKbGuPwO2k3K4MjAIM7PNvpBnsLykLnW9dnopKX SlxDL/pU Pte18mBRvjWaxawTMZu/mwySLHg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song wr= ote: > > > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > lookup, and try to move the found folio to the faulting vma when. > > Currently, it relies on the PTE value check to ensure the moved folio > > still belongs to the src swap entry, which turns out is not reliable. > > > > While working and reviewing the swap table series with Barry, following > > existing race is observed and reproduced [1]: > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < Somehow interrupted> ... > > > > // folio A is just a new allocated f= olio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B could use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < Somehow interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry pte > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it can b= e > > reproduced [1]. > > > > It's also possible that folio (A) is swapped in, and swapped out again > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > swap cache so it needs to be moved too. In this case we should also try > > again so kernel won't miss a folio move. > > > > Fix this by checking if the folio is the valid swap cache folio after > > acquiring the folio lock, and checking the swap cache again after > > acquiring the src_pte lock. > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > we don't need to worry about that since folios only might get exposed t= o > > swap cache in the swap out path, and it's covered in this patch too by > > checking the swap cache again after acquiring src_pte lock. > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Signed-off-by: Kairui Song > > --- > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..a1564d205dfb 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include "internal.h" > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, st= ruct vm_area_struct *dst_vma, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > struct folio *src_folio) > > { > > + swp_entry_t entry; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + /* > > + * Check again after acquiring the src_pte lock. Or we = might > > + * miss a new loaded swap cache folio. > > + */ > > + entry =3D pte_to_swp_entry(orig_src_pte); > > + src_folio =3D filemap_get_folio(swap_address_space(entr= y), > > + swap_cache_index(entry)); > > + if (!IS_ERR_OR_NULL(src_folio)) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + folio_put(src_folio); > > + return -EAGAIN; > > + } > > } > > step 1: src pte points to a swap entry without swapcache > step 2: we call move_swap_pte() > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's swa= p entry > have swapcache again - for non-sync/non-zRAM swap device; > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* src= _pte; > step 5: do_swap_page() for src_pte holds the ptl and found pte has > been cleared in > step 4; pte_same() returns false; > step 6: do_swap_page() won't map src_pte to the new swapcache got from st= ep 3; > if the swapcache folio is dropped, it seems everything is fin= e. > > So the real issue is that do_swap_page() doesn=E2=80=99t drop the new swa= pcache > even when pte_same() returns false? That means the dst_pte swap-in > can still hit the swap cache entry brought in by the src_pte's swap-in? It seems also possible for the sync zRAM device. step 1: src pte points to a swap entry S without swapcache step 2: we call move_swap_pte() step 3: someone swap-in src_pte by sync path, no swapcache; swap slot S is freed. -- for zRAM; step 4: someone swap-out src_pte, get the exactly same swap slot S as step= 1, adds folio to swapcache due to swapout; step 5: move_swap_pte() gets ptl and finds page tables are stable since swap-out happens to have the same swap slot as step1; step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the fo= lio. Yep, we really need to re-check pte for swapcache after holding PTL. > > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm, = pmd_t *dst_pmd, pmd_t *src_pmd, > > folio_lock(src_folio); > > goto retry; > > } > > + /* > > + * Check if the folio still belongs to the targ= et swap entry after > > + * acquiring the lock. Folio can be freed in th= e swap cache while > > + * not locked. > > + */ > > + if (unlikely(!folio_test_swapcache(folio) || > > + entry.val !=3D folio->swap.val)) { > > + err =3D -EAGAIN; > > + goto out; > > + } > > } > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, = dst_pte, src_pte, > > orig_dst_pte, orig_src_pte, dst_pmd, ds= t_pmdval, > > -- > > 2.49.0 > > > Thanks Barry