From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82B39C5B552 for ; Sat, 31 May 2025 06:54:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D1356B0163; Sat, 31 May 2025 02:54:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0AA416B0166; Sat, 31 May 2025 02:54:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F28BA6B016C; Sat, 31 May 2025 02:54:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D32156B0163 for ; Sat, 31 May 2025 02:54:31 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4C7E65F040 for ; Sat, 31 May 2025 06:54:31 +0000 (UTC) X-FDA: 83502289542.18.7045AF5 Received: from mail-vs1-f54.google.com (mail-vs1-f54.google.com [209.85.217.54]) by imf30.hostedemail.com (Postfix) with ESMTP id 6DCF48000A for ; Sat, 31 May 2025 06:54:29 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aV8Icibi; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748674469; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mt9LCU4KtOp/peg997ydOztpCwzP+Qu3HWBp0qmR5aA=; b=hPOzxXKiN/USx+rcqFH5a5AOq5fqtx57w+fQfs+S3hFNFu7dDm5oDGPP25SLcwmsJiwWvV Db6HTnMGvxUtpr7cNz5bB9kCYSJiVNSNQA1VNAwF5ql1zcT/gWlqDzuMxKc8O/VMB9qUjf Qwdel4I0LAn8AavJ5orRFVylzwPvwKw= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aV8Icibi; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748674469; a=rsa-sha256; cv=none; b=AbS1Dctn9hVOa5+1IQJrc5Hgd2umW70MhV27pONFgn7mBlX9HIsDwvDKsjUEKoYPxVPmsq Jjx8ktXiyGn3j81HJakjRR5ClSuQ5hUTGvAoAf6DKAP4RhC0gtF1rd44UWw+g4PR80tV59 LoswURNXicbta2Jco4VXZRiMVuRfb4w= Received: by mail-vs1-f54.google.com with SMTP id ada2fe7eead31-4e47c2a9cdcso1895435137.1 for ; Fri, 30 May 2025 23:54:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748674468; x=1749279268; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mt9LCU4KtOp/peg997ydOztpCwzP+Qu3HWBp0qmR5aA=; b=aV8IcibiquIcDhWiOYZSf9nhg6Y3cKFnLWvALTdPK0z8UGgu1nA9vRdqg8EZiChaMq UfgzyjpOm0nY1UVMUiaTuIcc8Q3kAOGulymeTFOifuaJcIDu/H7tB+wWYKKiy2119p9R AftIb+vH9E5iV3H4dd2RFsI/3RwtXCpFAf18o9VhVXT+PAr1tSpmb0pC+KH0ULlnz1CT Q0seMM5obPaFsSCY8gQ4Hm/eDRxQc+7WwyA0KOYnUh0Z7G+x8v3l4U4Wo1RxA4QgR7b4 mFe0fyfQqdW0q0qXTSwnJ+ErlftaUCSLW0JRcpKsFbesUkBhKyDETCr++tsQdytcvlpJ cDUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748674468; x=1749279268; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mt9LCU4KtOp/peg997ydOztpCwzP+Qu3HWBp0qmR5aA=; b=s2gPFvoKRqFZId81kz5E3XkzcKcFQTMmLMDnD9P9lJrVPaB0gOCC2j9Pz+HkOg1nKh KSlnmLyMLfYnXH+YrX2iaPQ7zmlwuV6SE7z+mmbTWUu5UDmzxKWO8rnrRSn7/uzjGC1l Udh4xpPwvH8GLZkZQ/957DX8EVHIQ6VwbA/Us/oiCQiJb8r99SVAdzQu9zzzTXRjRIOP DYU/X2FbvJdGc9xWks4g2S+DmUbym+C4Y12QsBgBTQ9ConCVG0IOvQUPZQDaBm6zjzl5 PYK3PdY436BszGojHvUlHl4NXx0TzlYsdIFNMyhK+bFnNuVOf41LKX/puwQQ/2+JRlcQ cbjQ== X-Forwarded-Encrypted: i=1; AJvYcCXp39YaJq2qI3w0i0hfPoE3KJjtHZ4+Uczaw0panK3/SgcAoBSfMeMEp+oDnlaWVmYjsxUM2NILZQ==@kvack.org X-Gm-Message-State: AOJu0YyG7N+vG8nXASaq41YR1ZbN/EWiX82yOBhJgctJFRK9JKb85fny ai9k4RdR2jjJMmZTY0AMaDaUA5szJ/PnWqg/z/oCdBE9WVUzI610HuAXA/7Ec9eK4Eb30G49Yb+ WXsqhmns89C0L09fgZNOKxYJ5vxz+BMM= X-Gm-Gg: ASbGncuya2HinKYZ9XYUrHiJl9f945fVb+D5PEXHg5PJMyVgAxcjDyt+oHECmrat1D6 LyOjb0cSaLIrW2l36xb0LGcjaFy2gnOcbJIneJsaPoXBv6xdxObDFEL9y0FHSSRLKs7y3uLhnsj VZ8GwRt2S28Sz+FOVYEbW2DNCn+030gVRLBw== X-Google-Smtp-Source: AGHT+IGRBxa3bLwqvjUcM5rSwBr4KjoL/kPcL0HkCbthe4esfOCzO4sYQL2vNy8Tg6EwJuWuhd/4wxdWTxUkF5LGpyA= X-Received: by 2002:a05:6102:3ed4:b0:4e5:c51b:ace4 with SMTP id ada2fe7eead31-4e701b098d7mr426437137.20.1748674468415; Fri, 30 May 2025 23:54:28 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 18:54:17 +1200 X-Gm-Features: AX0GCFuHIXE2nFc8E8Epyk2nWAYLEJ8Eun1d22K4Wew4_rSLkb6MQIrwV6xiJO0 Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: Lokesh Gidra , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6DCF48000A X-Stat-Signature: ewfk95cshedcdw8ptj9636qpffp79jkp X-Rspam-User: X-HE-Tag: 1748674469-648144 X-HE-Meta: U2FsdGVkX1/Zy6rdCh5zQbidCN8hQhvz9eb6arAFQPZLd1T45Z3DnAw862YCvCiSdMY7lLQY9+yrSdKSbSoD0nejjpAtKxVjkrmDwL9/7UGWlwyv8rlQ7PWKVeUljtz+m+wuY1IveXmfTP63MVjAQrSDRJ+rVXfa5MWSvtJo73CssseAx+y2EExsRuSQ+2byEtP2Exs10pTMMmgztZng/wFvB1GEVrPbdvBAxr+VvQaE8HXR2HdnfZym6hix0x21DX33D9mf+915ZNV8ApAyME1XVlfbOL0b867+GhnJg2dL9XzTp+O52IxZhQi//4+NnCm1mcfaR6amF4abk+vu2ZLzg6Uiy47ZGRVlklJ8ku4Vp73VWw06zJv27A29eVcd8K9YHmWXKHh5xWm/h3ujpAXRyPZOrOWhFgN8lC3uNMHipQkSRogetixFHXA8AXJ/sx+JAc9MbsTu1oXHxL92WAEKjtn2C1ONOz4TDcpdIx363gTEplz5UpdEeyEVMQ3J+sdcjfOi2g5iGRkTvpxLJbmJmiJREUJfT52khZv9kabl4UYZkCMrirCfeXv9XPpMTnE1fWsAJIox9VrCxfb2T2pdWVYuA+VpwBiHeFP/78un4moDb5e87Z08T2wNacEPMs/K850whD/OTX1q6WN+AzYATEk53ZywtUOTT6TrUJ9Dg0VfYV1r5U77sCN7wayob9gJIiSR5L8agHHbGxZbR0EDecBi0RNzpR4+TUUr9NzXfQXrFtBOcSUthcxq5GQbVPdrXQnpsS+4DsWBQtroGpecpureoN0sNJYL/YKa8eeWu9qQ9ePqO89Ph5DwaskU0B1rqw7MwGB2YL+jb5BMYId2i11NzQWjAipAFhe54MdAE5ONHPftGbyxu/ScZ5eEBHPrVmEz5wKpeRMzI4hhaumMH/RXL9L3IO4B1mEo3rXdgN56TmworeajZg6ArLq/CGm3pmGx5a/DbEV1WAP I8yYkb+H e1BMy7vSnKZ3nKLePw8/PqjW7J4jFpyzuomJ0WVm2Ha3LulHMoe18MlVkDag6cEmDEA2FxICISzwU1dZeYjukOGuGNVFzK7p3wLOnfPz2Shd/zPMmNEAb6JeVZg4lcO/C8MDIwSFfWJTQDUkmRBxWj3T6OcwgiQA0iRjWV+cuCZGCeIRhu/BWltMRDTafqEmVTyBrcMZxUbc8T+JGIh6Z8iSV6ETsXBEqofInAQk0au/nhxwk/umynk8zDNW6PaJA8hYAnvMEsEiOpMfMQjZXwV4JFlYYJ4GXm31upmXhS6Inr3ASznZ4N6x5i7IiksGE3hlmAXUmu7eZUlL3fr7dSeOfzYz0ls1FSspKfFiBWzpwUw05VONJqY4LcdXzeJdaG2JZZ+MuAgNSdU5PNeK39qapp1nzal6ddA2jvnWVa7ExR4e7WzUB2rByu+f8Z/Ru7FaTTDTry8YqtwI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 6:36=E2=80=AFPM Kairui Song wrot= e: > > On Sat, May 31, 2025 at 2:10=E2=80=AFPM Lokesh Gidra wrote: > > > > On Fri, May 30, 2025 at 9:42=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sat, May 31, 2025 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com= > wrote: > > > > > > > > On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song wrote: > > > > > > > > > > From: Kairui Song > > > > > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap= cache > > > > > lookup, and try to move the found folio to the faulting vma when. > > > > > Currently, it relies on the PTE value check to ensure the moved f= olio > > > > > still belongs to the src swap entry, which turns out is not relia= ble. > > > > > > > > > > While working and reviewing the swap table series with Barry, fol= lowing > > > > > existing race is observed and reproduced [1]: > > > > > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap c= ache.) > > > > > > > > > > CPU1 CPU2 > > > > > userfaultfd_move > > > > > move_pages_pte() > > > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > > > // Here it got entry =3D S1 > > > > > ... < Somehow interrupted> ... > > > > > > > > > > // folio A is just a new alloc= ated folio > > > > > // and get installed into src_= pte > > > > > > > > > > // src_pte now points to folio= A, S1 > > > > > // has swap count =3D=3D 0, it= can be freed > > > > > // by folio_swap_swap or swap > > > > > // allocator's reclaim. > > > > > > > > > > // folio B is a folio in anoth= er VMA. > > > > > > > > > > // S1 is freed, folio B could = use it > > > > > // for swap out with no proble= m. > > > > > ... > > > > > folio =3D filemap_get_folio(S1) > > > > > // Got folio B here !!! > > > > > ... < Somehow interrupted again> ... > > > > > > > > > > // Now S1 is free to be used a= gain. > > > > > > > > > > // Now src_pte is a swap entry= pte > > > > > // holding S1 again. > > > > > folio_trylock(folio) > > > > > move_swap_pte > > > > > double_pt_lock > > > > > is_pte_pages_stable > > > > > // Check passed because src_pte =3D=3D S1 > > > > > folio_move_anon_rmap(...) > > > > > // Moved invalid folio B here !!! > > > > > > > > > > The race window is very short and requires multiple collisions of > > > > > multiple rare events, so it's very unlikely to happen, but with a > > > > > deliberately constructed reproducer and increased time window, it= can be > > > > > reproduced [1]. > > > > > > > > > > It's also possible that folio (A) is swapped in, and swapped out = again > > > > > after the filemap_get_folio lookup, in such case folio (A) may st= ay in > > > > > swap cache so it needs to be moved too. In this case we should al= so try > > > > > again so kernel won't miss a folio move. > > > > > > > > > > Fix this by checking if the folio is the valid swap cache folio a= fter > > > > > acquiring the folio lock, and checking the swap cache again after > > > > > acquiring the src_pte lock. > > > > > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but s= o far > > > > > we don't need to worry about that since folios only might get exp= osed to > > > > > swap cache in the swap out path, and it's covered in this patch t= oo by > > > > > checking the swap cache again after acquiring src_pte lock. > > > > > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-= tqCzi+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > > > Signed-off-by: Kairui Song > > > > > --- > > > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > > > 1 file changed, 26 insertions(+) > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index bc473ad21202..a1564d205dfb 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -15,6 +15,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > #include > > > > > #include > > > > > #include "internal.h" > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *= mm, struct vm_area_struct *dst_vma, > > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl= , > > > > > struct folio *src_folio) > > > > > { > > > > > + swp_entry_t entry; > > > > > + > > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, = orig_src_pte, > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct = *mm, struct vm_area_struct *dst_vma, > > > > > if (src_folio) { > > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > > src_folio->index =3D linear_page_index(dst_vma, d= st_addr); > > > > > + } else { > > > > > + /* > > > > > + * Check again after acquiring the src_pte lock. = Or we might > > > > > + * miss a new loaded swap cache folio. > > > > > + */ > > > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > > > + src_folio =3D filemap_get_folio(swap_address_spac= e(entry), > > > > > + swap_cache_index(en= try)); > > > > > + if (!IS_ERR_OR_NULL(src_folio)) { > > > > > + double_pt_unlock(dst_ptl, src_ptl); > > > > > + folio_put(src_folio); > > > > > + return -EAGAIN; > > > > > + } > > > > > } > > > > > > > > step 1: src pte points to a swap entry without swapcache > > > > step 2: we call move_swap_pte() > > > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte= 's swap entry > > > > have swapcache again - for non-sync/non-zRAM swap device; > > > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clea= r* src_pte; > > > > step 5: do_swap_page() for src_pte holds the ptl and found pte has > > > > been cleared in > > > > step 4; pte_same() returns false; > > > > step 6: do_swap_page() won't map src_pte to the new swapcache got f= rom step 3; > > > > if the swapcache folio is dropped, it seems everything = is fine. > > > > > > > > So the real issue is that do_swap_page() doesn=E2=80=99t drop the n= ew swapcache > > > > even when pte_same() returns false? That means the dst_pte swap-in > > > > can still hit the swap cache entry brought in by the src_pte's swap= -in? > > > > > > It seems also possible for the sync zRAM device. > > > > > > step 1: src pte points to a swap entry S without swapcache > > > step 2: we call move_swap_pte() > > > step 3: someone swap-in src_pte by sync path, no swapcache; swap slo= t > > > S is freed. > > > -- for zRAM; > > > step 4: someone swap-out src_pte, get the exactly same swap slot S a= s step 1, > > > adds folio to swapcache due to swapout; > > > step 5: move_swap_pte() gets ptl and finds page tables are stable > > > since swap-out > > > happens to have the same swap slot as step1; > > > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move = the folio. > > > > > > Yep, we really need to re-check pte for swapcache after holding PTL. > > > > > Any idea what is the overhead of filemap_get_folio()? In particular, > > when no folio exists for the given entry, how costly is it? > > > > Given how rare it is, unless filemap_get_folio() is cheap for 'no > > folio' case, if there is no way to avoid calling it after holding PTL, > > then we should do it only once at that point. If a folio is returned, > > then like in the pte_present() case, we attempt folio_trylock() with > > PTL held, otherwise do the retry dance. > > Yeah I think filemap_get_folio is cheap, each swap cache space is at > most 64M big, so it just walks at most three xa_nodes and returns, not > involving any synchronization or write. > > The swap cache lookup will be even cheaper in the future to be just > checking one plain array element. > > I can try to fix this with the folio_trylock inside the PTL lock > approach, maybe the code will be cleaner that way. Using trylock in an atomic context and deferring to a full lock in a non-atomic context might be considered a workaround =E2=80=94 it feels a bi= t hackish to me :-) To follow that, the reader might need to understand the entire workflow from kernel to userspace to grasp what's happening with the folio lock, whereas the current code handles everything within the kernel, making it more natural. Thanks Barry