From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFF7AC5AD49 for ; Mon, 2 Jun 2025 20:34:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6B57B6B0334; Mon, 2 Jun 2025 16:34:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 68D396B0336; Mon, 2 Jun 2025 16:34:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57BCB6B0337; Mon, 2 Jun 2025 16:34:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 37F8F6B0334 for ; Mon, 2 Jun 2025 16:34:26 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id A0B2516132D for ; Mon, 2 Jun 2025 20:34:25 +0000 (UTC) X-FDA: 83511613290.05.1BAD51A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf15.hostedemail.com (Postfix) with ESMTP id 62FBAA000A for ; Mon, 2 Jun 2025 20:34:23 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VFaITMWf; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748896463; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HG1FlZp/JYWfFVMaM74f8RK6Xap63I2K8CkMFS8DVkU=; b=sW6+VmWFqTba+0FPhVXqgx+OxKRzac0xzNcLDSwVz74hi5U8ts+Yn87RzEHq9KQj45kLUi Hisqub/8ALCDC6MH2O8hJYnjz/4EzGnF6eJvtxhhJ+vxXKx92rjBkCPUK9n/lDjvM+OxVK BXtPtcQkP1vBVU3zRsluI3oC6yhs5pw= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VFaITMWf; spf=pass (imf15.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748896463; a=rsa-sha256; cv=none; b=GivLmyei46RmzIqfOuXAXcDsW2mom4tbDGDDYi6q7XwXTEhABOXTuzOYji9f/jCmEAt/QK /2+VlYBtGK/0W0UqYcyaG3mlTAalbtNGZOqxHtZUpzpwTHvR9y074ZO43IR0FrWNEE5DqT XJUQQx+gTI2cVVLDkjZbIxk3P6lnxCU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748896462; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=HG1FlZp/JYWfFVMaM74f8RK6Xap63I2K8CkMFS8DVkU=; b=VFaITMWfOnHLV6F6/C4eZMFwuNhmiW6WpVAUw1NoNVghRGZbQyDtbpFuulO4TWP+eKH7AF eJ6ppaWEsM1lix8i7/dgxnTVtrmzlmn9ToZvaTWumCsQ1XhbbWCzI2BfBEa0F2YFuUXinn V6MrU8aelwbZ4Ix+k+ow6arq/Tvq7to= Received: from mail-qv1-f72.google.com (mail-qv1-f72.google.com [209.85.219.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-640-Aj6p9yOxOXKqOIsdkZiXHA-1; Mon, 02 Jun 2025 16:34:21 -0400 X-MC-Unique: Aj6p9yOxOXKqOIsdkZiXHA-1 X-Mimecast-MFC-AGG-ID: Aj6p9yOxOXKqOIsdkZiXHA_1748896461 Received: by mail-qv1-f72.google.com with SMTP id 6a1803df08f44-6faca0f2677so113882796d6.1 for ; Mon, 02 Jun 2025 13:34:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748896461; x=1749501261; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=HG1FlZp/JYWfFVMaM74f8RK6Xap63I2K8CkMFS8DVkU=; b=nT7p80I2+dTkvg2kUtiF9tYZF0A4FO0oO4kruUwW3pAbgp79FCOE8Sl5ZaK0IN0Kjn tHeEdEt4U+x2seNJ8CT1mzxX2wJgqiBulp2HKQ/jEkhQH9eiKMpjr/n5Owtf5aibscR2 hL1PtJCmMoSpVlwQhYNYRu7FkENduyr/gynsSOxbRVLdTD9/9vL4qQOT6XA7leXSYZpe fLOvT7pFSinGTBSqEF4hiFs76ZoHay88+lT7D3yX1M1sWdwFITKdFUYPnei3Ex6V6jsO B/WjBZrGvwReQNclq8SmnISlrlGos5NB6gvrPUuzOSVZ+ix6Q7VIjBVsyaPBCnlOqnlo aOBg== X-Gm-Message-State: AOJu0YxV3h2ikBCg0TzTH9Sskdh+fOGXCEUA//gci24wUkBXbivzwwZZ A/8Vth8/PASZSn+uauQ5ZilsDwH43QfjGCMu8ePB79Z8uiLCMRXVII8fAEk/gGJ+1ny3QL+b1fv Hebq6uXd9xNCBhgnKYihcmXzL0kyOHGSAZrsn62ERlgd62MgS8Osb X-Gm-Gg: ASbGncsLG9c1TZi5W8K9mwsadB4ZBi4/zf4Oi4X7RAyUzBn4vJMMHxLnZMsLULRg3mT AO6BAuLZ9yMoXsgpoG3dbk+O9s5MfoHn4FGuT7P2jhyK0SI9ziLYX5pokwuN1aRYfKVgx9SDx9R sdAuIZwKpAWmYgSUKxULwmtryG1lWWI7cYTPfvyj2ew/qcDNZdp61OlgdjYDj/yll71qicOcXJ1 LKQqCOCq4RdeRwiWn4UI7OWRevOlkmCJ//9cP0YxJqqZfJEI0n8qwSAc8cISo/87p01ZQu4JTly ToY= X-Received: by 2002:ad4:5be8:0:b0:6fa:cb97:9722 with SMTP id 6a1803df08f44-6fad1a7094cmr243845476d6.34.1748896461216; Mon, 02 Jun 2025 13:34:21 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEyWuOtHzhzC4FqXUSZyEiBDumezo4F5LexvCZMNuuYmdPwR2gUsLwzi8/flISRy75TRlb+9w== X-Received: by 2002:ad4:5be8:0:b0:6fa:cb97:9722 with SMTP id 6a1803df08f44-6fad1a7094cmr243844886d6.34.1748896460706; Mon, 02 Jun 2025 13:34:20 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6fac6e1a6dbsm66570076d6.95.2025.06.02.13.34.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 Jun 2025 13:34:20 -0700 (PDT) Date: Mon, 2 Jun 2025 16:34:16 -0400 From: Peter Xu To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Message-ID: References: <20250602181419.20478-1-ryncsn@gmail.com> MIME-Version: 1.0 In-Reply-To: <20250602181419.20478-1-ryncsn@gmail.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: K1d-8Wxp_-HVrSzikyP95noNgpgzKBTqedwdRbVgi5Q_1748896461 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 62FBAA000A X-Stat-Signature: ycjczqr3g91kcwi5i3izzt3rnwc7ec7c X-Rspam-User: X-HE-Tag: 1748896463-745907 X-HE-Meta: U2FsdGVkX19QUmTZIfcC+jPvOt09exYFmS+WoTz3BHietQe4wLKInwD6PLpuc+oztA4z+oRmX6DVXXf2VqKrvQ2LDhGrcqb87Itm2cgOyvHEcjFDZRnOeAwjvh6qiaNeaocjBJ43ujghu76kKhgahb1T8RXN3OOllcMo/2LAdx4PLqQ3A1M4XnGkjTiVhY70pod6saI7M7Hb3QcpruXbhQRYpZUD006SU/Le05dKMl3/jv6lHqUOZQIDngzrSF0B9O04XN5Kn/FHGrHBstgWfmaA85NqulGwVsFnhtuQjniArFIi5lotJKx9PItDNXXbgGTwyhf8avXlD1XuVF0aiezXL74ux/QYe6fTsJfNGJ4XtPtMKgJLspRbog37OwmGfwHccB2L/ZBZe6eRsokE2TesS+1/srN2nhc7z267c8vNZ+FQyPH+EqiAbvoJmSUrThjmeBBC8DIildRyUu52Eiehiw0upOODKvK2HikhmdmQ2VlVqbnqRP2fLZyJEJDTW7E+mmONqiRvnrSBlk7QMXp0XOzWTH/6iZZcSFJm1b2S004U1IBgK98ulY518aOR4bqpNXxQpsvkYG+aQNujk8LeTp5au0SXp7QbTt2TKLc44qwhGPV/i2MJ9Nz1Qjo3tNUSqtJxVmxeKkY5s2xc3B4t6ksmCriYMDnIDASxc7gJhTj1nJExwwYzfKPaCyNj3QQ2A23YpEs0J943VZb/Pqbpq9TcJAANaEsbyEcZYyGYBq9Tn9RbyRSPfeQvpvuuGIXlL+QG4lYCOUFZkc9uRXN6KXrc7sCcfp/sMO56Y+Sdo5hGlixV4QguLJ9zUClhpPWUK8UgtlpUsY+DA0dkedcnedkuKNgYA5FASMJHyv4E/E3TBYUfBO+dmx9CTWOW+wNrbFTR9GDz54Nstr+eNXsL4SHP7ItZX7hHQc/OqTcohzZE1Y344LbjKNNjXsw/2zjKI9yvMiylTnpmVN1 +Ppi6tc0 /CCL2cHnNliZz01UrOPJFJng/4/DmQbGDdQ7X24RXRu6J1MaA9Gk3rbMXjVMbSQ+1YsUpZUz5XTBzJ/Qr8k1bwFGWosNX2IgTu70Fw8e2M3ak/MrTKCV+XZutpcSAkL9dmfSuMJz+tMqwvHKiNqm1195Vy7fs7nqXSDX2Xsp1nvNcUygrNQwtMtznLLNWeKpPfEt+QvND+HEKf4D/2gYsuguebtV7HvNKIc2JIPXvu7c5kbi2nQ31S3DTtgMNALUBsfnAv8HdfM2lQd5WfFfjrr+K2kSV3tTslnxP5Nn+1WqZlqRfI4zFD3HsZiHyExeEZYd9wTkuhRDlKSSdQV1YbUdniOPmSbaiPfXfBekFiyzCVCvNPLEzZcYoZV4CjRVrHy8ULea8H2mzSTV4YyTSPznmdtBq/QLLzkIK+jdnTn/dnJpBTznYoUHk7N4+I4W80LkYMdE1M7qI/sw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote: > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry = pte_to_swp_entry(orig_src_pte); > // Here it got entry = S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated folio > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count == 0, it can be freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio = filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte == S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. [1] > > Testing with a simple C program to allocate and move several GB of memory > did not show any observable performance change. > > Cc: > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song > > --- > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ > Changes: > - Check swap_map instead of doing a filemap lookup after acquiring the > PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/ > Changes: > - Move the folio and swap check inside move_swap_pte to avoid skipping > the check and potential overhead [ Lokesh Gidra ] > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated > value. > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..5dc05346e360 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, > pte_t orig_dst_pte, pte_t orig_src_pte, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > - struct folio *src_folio) > + struct folio *src_folio, > + struct swap_info_struct *si, swp_entry_t entry) > { > + /* > + * Check if the folio still belongs to the target swap entry after > + * acquiring the lock. Folio can be freed in the swap cache while > + * not locked. > + */ > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > + entry.val != src_folio->swap.val)) > + return -EAGAIN; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index = linear_page_index(dst_vma, dst_addr); > + } else { > + /* > + * Check if the swap entry is cached after acquiring the src_pte > + * lock. Or we might miss a new loaded swap cache folio. > + */ > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { Do we need data_race() for this, if this is an intentionally lockless read? Another pure swap question: the comment seems to imply this whole thing is protected by src_pte lock, but is it? I'm not familiar enough with swap code, but it looks to me the folio can be added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the folio is locked. It doesn't seem to be directly protected by pgtable lock. Perhaps you meant this: since src_pte lock is held, then it'll serialize with another thread B concurrently swap-in the swap entry, but only _later_ when thread B's do_swap_page() will check again on pte_same(), then it'll see the src pte gone (after thread A uffdio_move happened releasing src_pte lock), hence thread B will release the newly allocated swap cache folio? There's another trivial detail that IIUC pte_same() must fail because before/after the uffdio_move the swap entry will be occupied so no way to have it reused, hence src_pte, even if re-populated again after uffdio_move succeeded, cannot become the orig_pte (points to the swap entry in question) that thread B read, hence pte_same() must check fail. I'm not sure my understanding is correct, though. Maybe some richer comment would always help. Thanks, > + double_pt_unlock(dst_ptl, src_ptl); > + return -EAGAIN; > + } > } > > orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > } > err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, > - dst_ptl, src_ptl, src_folio); > + dst_ptl, src_ptl, src_folio, si, entry); > } > > out: > -- > 2.49.0 > > -- Peter Xu