From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E39B8C5B549 for ; Sat, 31 May 2025 06:10:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 056AE6B0146; Sat, 31 May 2025 02:10:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 006EA6B014C; Sat, 31 May 2025 02:10:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E5F116B0150; Sat, 31 May 2025 02:10:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C88656B0146 for ; Sat, 31 May 2025 02:10:18 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5357AC111C for ; Sat, 31 May 2025 06:10:18 +0000 (UTC) X-FDA: 83502178116.27.7CC1260 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf15.hostedemail.com (Postfix) with ESMTP id 561AEA0007 for ; Sat, 31 May 2025 06:10:16 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hWYjKeEx; spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748671816; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pseK2rncPdJaU+2inXcRd67qeSB4vKaK2R5IOjuqGOc=; b=dSPDV0EYiTwYWsQS6PuPHWDAFpwFbno7oK0lt+7DuiH61mogZt9jbexkPWf7bF2gRUJU9X jk9fQ7SYH9E/gxmt8CDMf2OTOnSohBe6K7CA20VKlnBQl8+N62TLrUWy3ScHf8IwEqZfa9 hgDdSWbegVaOzcFwp1ryW6z++IF97zs= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=hWYjKeEx; spf=pass (imf15.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748671816; a=rsa-sha256; cv=none; b=xt2IcfN9NBY6rhIkxKg4bAxZSVc9rW0m3f/gQcEE7mOpjd9SbVNAMOaCH8BkOD7PywGKCH qa5lOLFqX3h3Atg7Aj2MUL7G8+p+k8YgwNnfruH4ufS4Pm/cMujqMvknM2pW1X/EJSpAhF Az8aUIlZzzObqb0y6nBJjg5U2t46di8= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-6000791e832so2792a12.1 for ; Fri, 30 May 2025 23:10:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748671815; x=1749276615; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pseK2rncPdJaU+2inXcRd67qeSB4vKaK2R5IOjuqGOc=; b=hWYjKeExtDDi3Hus6Iefa6U9+dWkKWNEmjnFbuZwi14/xAmfgZpTswlTSRGokUQkK0 XnPgcrrwRQEJgut5yjDf9bcb2r0PyGj1Xpfnpmeqg10XBdS+KtkkoZvm/g3QJcXt4vxZ ZVjSgMHn6thVlhod7d/6m8s/XHuI8/H7/EVnoxcIDqlxESfE697WoBubi7TxhuaZtu1m damaDOXiJqE3y4V1fA/BWQLxFUiPVKcuI+I5X3oNwx27W9PaUxPKLWjho5igJ1JAeo1s iFXMqY3tlUHMK7ZTPvm9wWCsP9Scd8mkKmiJu+4gyFCmvskwNdLtwDPCEkOnHNlIA7Nr SoCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748671815; x=1749276615; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pseK2rncPdJaU+2inXcRd67qeSB4vKaK2R5IOjuqGOc=; b=M//YdWRTGLUe1cY+Pasl8r+GaDeMHSNMEP+F6SXs9IpmLDdZBVBMaQeFvnBGu48zuN YkMZnjOURmONIW5nHiSAZo3CBRc3MZ2OzDvo70H+pmWu5NlqiqvBnYl9UL9UpQJK42QB L6Z8JKJ5KwdrBHo+ZuaFPZO/J3EkPcYVUymSqa+WbKmFPQUay/r3Xxrc4jjErT491hE1 r3iZl6mm7CVIICkBp7j2mE2XnL0GBC7RYc9l3YxKxzcMzahp4WU5+vdK9Gdqm3UyiTqn lOZsnJ4BLAEfb4Ph8q/XNqP73dzgtJ4JFEx6xApF9C4Ng7jcZRs5YdIW1tQ7T6PQqt0j 1EKA== X-Forwarded-Encrypted: i=1; AJvYcCXKNyFBjEXBvOuR4nO8afiMd4bTlIRKvE3Qry7bQ72oIgoXYB2jqlkwgc2i/Qk92FXOXHZwd0IUVg==@kvack.org X-Gm-Message-State: AOJu0YwZRn35kShpms9UIdX4IdXSsiX/KaYyRj2ErbpwSZi5d1rP4itG hLmuLBDgyXWxxV60FmLL1m6E12IFp+a/MtdgJe59IRHzOLsJ2Ttf+oBoFN/q+Mcnf+Jeh0reYjO SK5fHJ5CfI1lOy35hhQWeXb74lYiGrtdTAhltEX8V X-Gm-Gg: ASbGncsjClwl97P9bln9SLGLFQ0arba5GuhYLEHZF1F3+I9RDF2Kn4cFfsBBqwsU/I9 Api1vvXJirr75p2ETurFEDZCGCco4WyuKBe0PW/+FROhosGqEt++Np9uMpdyGA9sqIx/qr5SCOe soqiynkthxvbSSbZFl7gjSgpXCxCVYyqbXVCI8pGpIecA48XxiEO+DWufTVAkUiobJ0o8ew7Ewe JIC X-Google-Smtp-Source: AGHT+IEiOzwAr+Z2AokXrUSkjBMx/A9WaxidWTroqSIaYoiamYD2Biw7ekbMkoH0qfrWg4czPLT2fGqwJhd79Zjh4/k= X-Received: by 2002:a05:6402:2217:b0:605:c356:45a2 with SMTP id 4fb4d7f45d1cf-605c35645ccmr2343a12.4.1748671814279; Fri, 30 May 2025 23:10:14 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Lokesh Gidra Date: Fri, 30 May 2025 23:10:02 -0700 X-Gm-Features: AX0GCFuENRxcHgF4CJFYuPHl_RiBHS5OnLS5Gk9oKdSbi9NUFrrwfUNlWVjB2B8 Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Barry Song <21cnbao@gmail.com> Cc: Kairui Song , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: bjnrfyy5t9gboykqwagccuehonx7qcyp X-Rspamd-Queue-Id: 561AEA0007 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1748671816-441454 X-HE-Meta: U2FsdGVkX18zfQO1k9zqVj9YKK9II780D+35frUCJKLqXI5xr5YrbwNwF0TFM8aQASUZ2W04FMp90SBVHSs2k6ziYprSC24G4NUfL0qBde9HFf09TQiUx77VYIpGCtZwZOLbB15agvrpglb6U6zKtNJ6tTfJT63U7nSWxpwnsHNuwE+LHp22jYQa7DTVrASwbwHj+Olx1L4vKB13ZXTADgVdZGY6wEtPrr34ky2tCp+CpmsI/KhD5v/JWTEYNxneu4by4cq3zHpiVpvg7abq7ctQmVd1XJzLs92dmIRBP2SBZXY738j+Y1K7sQzuw2uKmm0cVPztxEvn1ZNHdePiVnhYvse5AzKvHK83tUL0haYSztJF8VL4MTRSevpjNr1d1W3MwIjUoZDM3L3LGFpadSBLjr4ARxpp5o3IvUft5XAB6yLN25J8wojxaRZU7X/j/++eOdrFHxQhVnrCUvz//t3aGal+/zSYXoZucivS0oT15+x1aTM4lzRnLm4/gcLVQ51gRiNZg6BBzb247sba8gnnNRAkOhmwNTlW3WvBUVHx2EVgCR+UKouMJE7F2bUnASKFGjIfTw/N+u9SCG7Uh5XVZE+DasLkjxwHx9csKsjH9lFUOcbGBj3cy8aNV6H0KOkrB5tKQh9oyJPEKR7YtnHDbMxgv9WvHKX0w6Za4yYbcgW72PgWBdIWP4zBt98x81cSOGXErgwzwUyT8By7bwjXLNKIj7GF2nZvwaM1wdEcBI8G3DD9k2A2dVGmHM42Cp4JYBRpKoQHe6f70LTJHn4u6kjhYq0Ry6SEptu6F92NsLUMSQQUyrn5A4CKJQz2AjobsF0UHH/o70Nq7QJqI4oCHxYs3rUmwivvOVUjzHpbkf3wwF+S3ZOnMamnXaoYFHcSKNbjUr4WhxTy6oEzxRSrVPzDZMwl7eCbh9t5xxsHXQCRvVlll7GgQW+keNOUtKN4j/MFDUXM3JLf3CV eGQYjQyQ z097n6DFpKET0j7nMAsez77BFiQjGNB5zjc5gkr8dAO+/DiwqtPwi0HQOlU8+xSuwnLjtZ9Ym/RfrYYrvzKPaiZni9TwWAqjsjoEN3sP4LLHVmEyE+KP8BEg1wASdBSGhkQ/rz99Ibk5TZFCB71MqxtpGKFT0QLBID3ppYOWSkI4QCyYKHgVz+i8EmTXuX3woHJyNN0RsmksPwL2F7UdRCMf6qo6o6olr6RKhVv4LahSi6nUOkrdB6vjQ8l4We6se8NJe9s3hCoCZforTb/6s7eQaZV1mVcOBeUrlKzj2URY9zyx3971Fys8/Xw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 9:42=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, May 31, 2025 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Sat, May 31, 2025 at 8:17=E2=80=AFAM Kairui Song = wrote: > > > > > > From: Kairui Song > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cac= he > > > lookup, and try to move the found folio to the faulting vma when. > > > Currently, it relies on the PTE value check to ensure the moved folio > > > still belongs to the src swap entry, which turns out is not reliable. > > > > > > While working and reviewing the swap table series with Barry, followi= ng > > > existing race is observed and reproduced [1]: > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache= .) > > > > > > CPU1 CPU2 > > > userfaultfd_move > > > move_pages_pte() > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > // Here it got entry =3D S1 > > > ... < Somehow interrupted> ... > > > > > > // folio A is just a new allocated= folio > > > // and get installed into src_pte > > > > > > // src_pte now points to folio A, = S1 > > > // has swap count =3D=3D 0, it can= be freed > > > // by folio_swap_swap or swap > > > // allocator's reclaim. > > > > > > // folio B is a folio in another V= MA. > > > > > > // S1 is freed, folio B could use = it > > > // for swap out with no problem. > > > ... > > > folio =3D filemap_get_folio(S1) > > > // Got folio B here !!! > > > ... < Somehow interrupted again> ... > > > > > > // Now S1 is free to be used again= . > > > > > > // Now src_pte is a swap entry pte > > > // holding S1 again. > > > folio_trylock(folio) > > > move_swap_pte > > > double_pt_lock > > > is_pte_pages_stable > > > // Check passed because src_pte =3D=3D S1 > > > folio_move_anon_rmap(...) > > > // Moved invalid folio B here !!! > > > > > > The race window is very short and requires multiple collisions of > > > multiple rare events, so it's very unlikely to happen, but with a > > > deliberately constructed reproducer and increased time window, it can= be > > > reproduced [1]. > > > > > > It's also possible that folio (A) is swapped in, and swapped out agai= n > > > after the filemap_get_folio lookup, in such case folio (A) may stay i= n > > > swap cache so it needs to be moved too. In this case we should also t= ry > > > again so kernel won't miss a folio move. > > > > > > Fix this by checking if the folio is the valid swap cache folio after > > > acquiring the folio lock, and checking the swap cache again after > > > acquiring the src_pte lock. > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so fa= r > > > we don't need to worry about that since folios only might get exposed= to > > > swap cache in the swap out path, and it's covered in this patch too b= y > > > checking the swap cache again after acquiring src_pte lock. > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCz= i+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > Signed-off-by: Kairui Song > > > --- > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > index bc473ad21202..a1564d205dfb 100644 > > > --- a/mm/userfaultfd.c > > > +++ b/mm/userfaultfd.c > > > @@ -15,6 +15,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include "internal.h" > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *mm, = struct vm_area_struct *dst_vma, > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > struct folio *src_folio) > > > { > > > + swp_entry_t entry; > > > + > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig= _src_pte, > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct *mm,= struct vm_area_struct *dst_vma, > > > if (src_folio) { > > > folio_move_anon_rmap(src_folio, dst_vma); > > > src_folio->index =3D linear_page_index(dst_vma, dst_a= ddr); > > > + } else { > > > + /* > > > + * Check again after acquiring the src_pte lock. Or w= e might > > > + * miss a new loaded swap cache folio. > > > + */ > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > + src_folio =3D filemap_get_folio(swap_address_space(en= try), > > > + swap_cache_index(entry)= ); > > > + if (!IS_ERR_OR_NULL(src_folio)) { > > > + double_pt_unlock(dst_ptl, src_ptl); > > > + folio_put(src_folio); > > > + return -EAGAIN; > > > + } > > > } > > > > step 1: src pte points to a swap entry without swapcache > > step 2: we call move_swap_pte() > > step 3: someone swap-in src_pte by swap_readhead() and make src_pte's s= wap entry > > have swapcache again - for non-sync/non-zRAM swap device; > > step 4: move_swap_pte() gets ptl, move src_pte to dst_pte and *clear* s= rc_pte; > > step 5: do_swap_page() for src_pte holds the ptl and found pte has > > been cleared in > > step 4; pte_same() returns false; > > step 6: do_swap_page() won't map src_pte to the new swapcache got from = step 3; > > if the swapcache folio is dropped, it seems everything is f= ine. > > > > So the real issue is that do_swap_page() doesn=E2=80=99t drop the new s= wapcache > > even when pte_same() returns false? That means the dst_pte swap-in > > can still hit the swap cache entry brought in by the src_pte's swap-in? > > It seems also possible for the sync zRAM device. > > step 1: src pte points to a swap entry S without swapcache > step 2: we call move_swap_pte() > step 3: someone swap-in src_pte by sync path, no swapcache; swap slot > S is freed. > -- for zRAM; > step 4: someone swap-out src_pte, get the exactly same swap slot S as st= ep 1, > adds folio to swapcache due to swapout; > step 5: move_swap_pte() gets ptl and finds page tables are stable > since swap-out > happens to have the same swap slot as step1; > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the = folio. > > Yep, we really need to re-check pte for swapcache after holding PTL. > Any idea what is the overhead of filemap_get_folio()? In particular, when no folio exists for the given entry, how costly is it? Given how rare it is, unless filemap_get_folio() is cheap for 'no folio' case, if there is no way to avoid calling it after holding PTL, then we should do it only once at that point. If a folio is returned, then like in the pte_present() case, we attempt folio_trylock() with PTL held, otherwise do the retry dance. > > > > > > > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > > @@ -1409,6 +1425,16 @@ static int move_pages_pte(struct mm_struct *mm= , pmd_t *dst_pmd, pmd_t *src_pmd, > > > folio_lock(src_folio); > > > goto retry; > > > } > > > + /* > > > + * Check if the folio still belongs to the ta= rget swap entry after > > > + * acquiring the lock. Folio can be freed in = the swap cache while > > > + * not locked. > > > + */ > > > + if (unlikely(!folio_test_swapcache(folio) || > > > + entry.val !=3D folio->swap.val))= { > > > + err =3D -EAGAIN; > > > + goto out; > > > + } > > > } > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr= , dst_pte, src_pte, > > > orig_dst_pte, orig_src_pte, dst_pmd, = dst_pmdval, > > > -- > > > 2.49.0 > > > > > > > Thanks > Barry