From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EF8AC5AD49 for ; Tue, 3 Jun 2025 07:03:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EF0BF6B03B9; Tue, 3 Jun 2025 03:03:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E7A6B6B03BA; Tue, 3 Jun 2025 03:03:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D426E6B03BB; Tue, 3 Jun 2025 03:03:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B1E8A6B03B9 for ; Tue, 3 Jun 2025 03:03:32 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5D840C0738 for ; Tue, 3 Jun 2025 07:03:32 +0000 (UTC) X-FDA: 83513198664.21.59D5D1C Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf09.hostedemail.com (Postfix) with ESMTP id 6CCC3140006 for ; Tue, 3 Jun 2025 07:03:30 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h8WpFPOw; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748934210; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7dT7WqbqtKt8W1daG7MTBJ0atov+qwpnMMDskvCo+QI=; b=Bv8h0BBwx7NdqfGsksWZw7tXrCTj4N5HhKIQX9ZSsb9MykuhcVh5w1fP/maT88VPmHvfok wE8m8SyNdD8ila0wlcU6inOgPGF0u8xfQYoZ3GrCA85Vh7OdJ+4Y3V7VN06dtXzLgqRpKk fLD/R6rUZD7QaBqf846ISi+C2BuPavk= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h8WpFPOw; spf=pass (imf09.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748934210; a=rsa-sha256; cv=none; b=dv7tVLyouFLmMIZO/mldfRlxAnuQ4LnBGN7YEpYhdAcoSM7GbG7NbgiBRMNhEaIJfHhgz/ Gvg4+FIisrtQsUqoQte5dbhjpWvYr6OYkg0mPTYiYs6WnzBrvj6Fj13siEo4bfTPLllwQN BLXNRGJ2o6UpVYY4cG9xNJiYS2OKeU0= Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-3105ef2a08dso43363831fa.0 for ; Tue, 03 Jun 2025 00:03:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748934208; x=1749539008; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=7dT7WqbqtKt8W1daG7MTBJ0atov+qwpnMMDskvCo+QI=; b=h8WpFPOwsmOjIjyeOflDQZvP5qDMQDm52jlgasU5znVtzz26oK9V1khn1i0hhxuH4q qbAiMM38934kYbXph8qdJsLfVzFupgCS4zJ0tgkQWtPJPxyvRtDpX1net1TuBFQdXjco CzYtdKxZCigUyqLOtNTiUpy68OOX0vHHG0FCjxP+08LUa+LcSgPxBPSlFwjwl++OecJ0 Ks55XQCJoCftC12JLs1Qn/HNX5DPcltVYqQPFddXG1FzQydVGK+xqjaD384jLcWMNgLg iMMXQRz4jes1hQZGDM43FARpApHw+eNAjj58YxSqZzPQscfAMO/vxIvFbz2Zs8QQx6Pi Sg2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748934208; x=1749539008; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7dT7WqbqtKt8W1daG7MTBJ0atov+qwpnMMDskvCo+QI=; b=aKiSsBAaFbPMOdLxMhPKRRhEWCRKUIV7/jrk75DrPmjyBQsbJ82u0IerM9IITWpFdX N4QE+EB+Ph/ToUoAlQPerhOSW7nNu7g5OVMH1sso3j7Zi20tPfF5mJLbwMeNSGYDZXp4 tMBcVbsx9gSXOKO0x0SnNKbY7wDqdfmGhE9FH9Xb6ewwNZvZztntqy87DNZmMBHvRaFq RQT2NmER12B1BKMxGti7xEUciu5+UIBgFDxFuBuYo3WVUmIRnGT5dC+MQSTxuVGsl5Gr SkJfumwkSBBdXL+f5ooJBhVDyoaz8iDxtLoTjjXnWFKrkY5C/jksqBQ8ABt3vDjVSEe1 Vr2Q== X-Gm-Message-State: AOJu0YyST9Q7g2mzXIawtxXxgJpJrGBKg1lYUTyLLKuL0YppIuT/jXCk hxEYkUjar33vUn21tnU6hTFqB75IgJNHtTR4/krer55wYtCBFdZtoA3lJNr2LEo08tH9DwhYctm LmBbnoakbJpLXL4s7NpH/MAQ7oR8Ne5c= X-Gm-Gg: ASbGncvDzRmqVZnz30Em3h9bpGFnp8aEoDtdCRzR+4TYDN51Udi34jq9coff0TLDl8T L/svlZOOXj6eUiyZzTqx6jklNRLo9AYsqeDfDRM9v6EQdKIkAuwJtmFger3GKIjRv3FhwKoHsEo PpvAANqkATL3p0Wnc4mIJA2pFd2AmbqHqy X-Google-Smtp-Source: AGHT+IGnOaoXfRDc8wENHB7ftsUeRKl6dkZW7ROwPATi+QEN4kv+OgZpwURqdvXET1pFOPIM++/PwJLnVxwy01Gkf2s= X-Received: by 2002:a05:651c:30c8:b0:32a:604c:504e with SMTP id 38308e7fff4ca-32a9ea7e3d1mr34030951fa.38.1748934208051; Tue, 03 Jun 2025 00:03:28 -0700 (PDT) MIME-Version: 1.0 References: <20250602181419.20478-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 3 Jun 2025 15:03:10 +0800 X-Gm-Features: AX0GCFuWW3YAYCssO1rwk88946aShV5_1GijPSRyW18Gr9SFJkf0tK_p7b5ZjC4 Message-ID: Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Peter Xu Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 6CCC3140006 X-Stat-Signature: oggr8u5g8hcc54t7uspufrhxnma1m3qw X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1748934210-603804 X-HE-Meta: U2FsdGVkX1/32K8GOkfRWWUrJUuQ50hghw/sOhuVQ5FA1dAp0y+8gwpgRQc+f7SxDBVN0EIqmQJWxIFVuy3PQ5CVJla0G2kjUrS52GAWg4icnfNrRyr2GgvAPdkvTuw2uRvmDykxSenZCjSBNXU3WhKc1YbDuUhw1KStGYfINvT47aQfhn6k+EuFLQQ2+/RYRbeQ7CTGFHd6YszumoHDbjz5mT1hj3+gY0mQZpZAFTOq6X0y039L/AcO4qfFIO9pe1qoPNL9r+0Zlk4Neo0qKstGy61rMkNEBPl1v06HFhvEoVJAY0QQvIBmxP6HyTmo1e65BVNTT2qpnjHteYvlWVNCCne7FjHPRNsHmRV7jICabCpkZ6J9xDiEQx1/bzZ+0On4BGk0OybIBz6Af1luFDUBCgaUqL9+Kn3fLHn1zoEYWXWxj8LBpP0TaSk+6XfD8NvnEma2xZa2F4xUqJhZOyeBwE8meEHGESw0BxhEVuUF/3gwDY1aN0+KjXSQTCppMGv8gCdeK29nXK4gM2opF8EZoIcR+QEYNfYDQ9pcEx/n48jHFH5bA/nUpCNkKWgSOxB/F1i730zfFsqfciOIkOcZ2L5INroD3NY7+YNuf06dMEsyQrSfZojlnWMg/d21z7/UpnpQU1qZqtRrrqlVXefeVJDinV3o0fvzLG9SFFPhe2iyVZJ+egaDSZJ5uv73VUrMIrrqIAWrmJn79wwhjVUtI+G1gbMF6NhB1ir1jj4qC2e7BLYhesiNWG//oYCoiCQToP1yMHc5X4/0rYkwBJQWjWDKokP3n7AlIhB3tFtKwQXev979QnX/Ima4OrkwIiTl5OJ0cex2WNFK48bJBLMeG97AzG90whZiZT9FfDR4zLwp3gsvxbrBoGVUeqWxRoYAK/+ct+miekoMTXVYenwnmpLk44IWlIsd4wbnjLIAjgkJGa/cBoKrby4UTnmEshh+ijGpjfeIo9SMdC7 CcZ0u5ZV QcZ4Bjh5QT2TF7NK0sZKedD4dcne30XV7D+MLozql9+LdUGaO9vODrThVQnvP516DAIZd6Fsblcr2ZHf+2zGL3L59qeKYFn+1mbE/5R27bNiq7Wxeb/8K13Hcw4ywBW52rzJFDJdcs0dEhDHVn4FpXKOS1Aov4qFmDvjLeVpXr7V5BoVYCyLF1o3opIlAx9x6HeV7tJglCuuKXXBe4dmWPjBqylBtlk1/qUPzBZfRPadcMH5VPLI158tdchZ49CLfloNHRBY3pgFnATHCLPMqukiu3TFL/PuL112Cls1JcWk4xb+avocJVYAAGYQfRImV0qY2MgqXoISebSKVwus1QGTN+LE8YlPFC36wUr2dho9saROVrDSvuh8tYQ2HAGmEtub/rO56YP1UfxABbNVPBTGlTfCJQETDT9fNIZqyxPY3LyVBF3LmtYqR8eGpa0Pw7hQfGgStxcPoFlM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 3, 2025 at 4:34=E2=80=AFAM Peter Xu wrote: > > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote: > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > lookup, and try to move the found folio to the faulting vma when. > > Currently, it relies on the PTE value check to ensure the moved folio > > still belongs to the src swap entry, which turns out is not reliable. > > > > While working and reviewing the swap table series with Barry, following > > existing race is observed and reproduced [1]: > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < Somehow interrupted> ... > > > > // folio A is just a new allocated f= olio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B could use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < Somehow interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry pte > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it can b= e > > reproduced [1]. > > > > It's also possible that folio (A) is swapped in, and swapped out again > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > swap cache so it needs to be moved too. In this case we should also try > > again so kernel won't miss a folio move. > > > > Fix this by checking if the folio is the valid swap cache folio after > > acquiring the folio lock, and checking the swap cache again after > > acquiring the src_pte lock. > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > we don't need to worry about that since folios only might get exposed t= o > > swap cache in the swap out path, and it's covered in this patch too by > > checking the swap cache again after acquiring src_pte lock. > > [1] > > > > > Testing with a simple C program to allocate and move several GB of memo= ry > > did not show any observable performance change. > > > > Cc: > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Signed-off-by: Kairui Song > > > > --- > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmai= l.com/ > > Changes: > > - Check swap_map instead of doing a filemap lookup after acquiring the > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh G= idra ] > > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmai= l.com/ > > Changes: > > - Move the folio and swap check inside move_swap_pte to avoid skipping > > the check and potential overhead [ Lokesh Gidra ] > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to date= d > > value. > > > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..5dc05346e360 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > pte_t orig_dst_pte, pte_t orig_src_pte, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > - struct folio *src_folio) > > + struct folio *src_folio, > > + struct swap_info_struct *si, swp_entry_t entry) > > { > > + /* > > + * Check if the folio still belongs to the target swap entry afte= r > > + * acquiring the lock. Folio can be freed in the swap cache while > > + * not locked. > > + */ > > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > > + entry.val !=3D src_folio->swap.val)) > > + return -EAGAIN; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > > + } else { > > + /* > > + * Check if the swap entry is cached after acquiring the = src_pte > > + * lock. Or we might miss a new loaded swap cache folio. > > + */ > > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS= _CACHE) { Hi Peter, Thanks for the review! > > Do we need data_race() for this, if this is an intentionally lockless rea= d? > > Another pure swap question: the comment seems to imply this whole thing i= s > protected by src_pte lock, but is it? It's tricky here, in theory we do need to hold the swap cluster lock to read the swap_map, or it will look kind of hackish (which is the reason I tried to avoid using that SWAP_HAS_CACHE bit in V1). But I think it's OK to be lockless here because the only case where it could go wrong is: A concurrent swapin allocated a folio for the entry and updated the PTE releasing the swap entry, then another swapout left the same folio in swap cache and changed the PTE back reusing the entry. PTE lock is acquired then released twice here, so holding the PTE lock here will surely see the updated swap_map value. In other cases, the is_pte_pages_stable check will fail and there is no race issue. > I'm not familiar enough with swap code, but it looks to me the folio can = be > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as t= he > folio is locked. It doesn't seem to be directly protected by pgtable loc= k. You mean "as long as the folio is unlocked" right? > Perhaps you meant this: since src_pte lock is held, then it'll serialize > with another thread B concurrently swap-in the swap entry, but only _late= r_ > when thread B's do_swap_page() will check again on pte_same(), then it'll > see the src pte gone (after thread A uffdio_move happened releasing src_p= te > lock), hence thread B will release the newly allocated swap cache folio? The case you described is valid, and there is no bug with that, it's not the case I'm fixing here. > There's another trivial detail that IIUC pte_same() must fail because > before/after the uffdio_move the swap entry will be occupied so no way to > have it reused, hence src_pte, even if re-populated again after uffdio_mo= ve > succeeded, cannot become the orig_pte (points to the swap entry in > question) that thread B read, hence pte_same() must check fail. Swap entry can be freed before the uffidio_move's PTE lock: a swapin can finish and release it. Then another swapout can reuse it.