From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1ED57C5B543 for ; Wed, 4 Jun 2025 15:10:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A8E96B00E7; Wed, 4 Jun 2025 11:10:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9581F6B00E8; Wed, 4 Jun 2025 11:10:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 86E576B00E9; Wed, 4 Jun 2025 11:10:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 655986B00E7 for ; Wed, 4 Jun 2025 11:10:51 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 254C481937 for ; Wed, 4 Jun 2025 15:10:51 +0000 (UTC) X-FDA: 83518055502.11.23E31A6 Received: from mail-qv1-f49.google.com (mail-qv1-f49.google.com [209.85.219.49]) by imf20.hostedemail.com (Postfix) with ESMTP id 2CEA71C000A for ; Wed, 4 Jun 2025 15:10:48 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nTYuok1l; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749049849; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=zlNmd2XKllw/97ryueauENrkZuEH44uo2nMCnWBruCU=; b=terOLfEMEeV1kIU5l8xu2bFwKiI86IO3liSkZfmNu/wbeJ9rxErR+KHTNVVGXRF+0IU8hB PL50RHhFeENjmUdtT/IbNK9LADycaA7k3Iq3z/jdDHKkxGqQRdLgBcjXRBHFqrqdjNsHn4 hl5hbTvfL/rdvgxOfTddTPG1g2tsiTM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nTYuok1l; spf=pass (imf20.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.219.49 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749049849; a=rsa-sha256; cv=none; b=QrwDzA4X2vmtDVFcDt2XUZ2IymzpU6c3lx+nIGh7M3GfkyTMnm+WDimTEBbG1JhIIZVj+k RU/sS2KXHH3vc1K7eHOtF5yohF5PiFKspl8sVH9lTC3QTnS8rzkPqV+rnbiBiifYgVcBaZ zAddAYvSailvQrOPTGv042b21UFe7qg= Received: by mail-qv1-f49.google.com with SMTP id 6a1803df08f44-6facf4d8ea8so287576d6.0 for ; Wed, 04 Jun 2025 08:10:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749049847; x=1749654647; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=zlNmd2XKllw/97ryueauENrkZuEH44uo2nMCnWBruCU=; b=nTYuok1l0KrV3ELUB5uUdWsWAy6oT2FCAdlS/KJ0XNyaIaf91JL1H44BKGn5TgaxCL e2us9QZs5TL7so8Hna48S3KWdwGvQrZdF6feOeNMm8meDSaNswrreFVRLyoxbtDHP14T NfW5B22OuGJsCzlBzphRClUQ7prafo4++cnwZSiiG/XXv5x7hGwS613MaVoKoyjBjpEh T1TnMSeUZNvqr+BmQbIyIqDc16IRPE7x5orE1yyQrg07/XRauayNCzZnh/aEXNqj1LrW 7qM/0Zy8cJCcnvPUy8zMLGHP0ZheGGr6EL1uhtWUUEbans8rinFeTkmmeBfzcWwr7JYA jG5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749049847; x=1749654647; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=zlNmd2XKllw/97ryueauENrkZuEH44uo2nMCnWBruCU=; b=FU6yflsw2gxYDeWbh1hl6SRAfwhhsFjJxfY3I0v5FTaaBVg4ekGs5RS/eQ4Zl/olx+ erxiCOUNqnz21brmdDlU3eXlC3vK6r4yZYblSynNntg4qYTmtisiQnxO+mHWKH7Vq/uU sHtjwDFSKC49pNX34MqrFnhzOcWYcCrgYy+ksLwwaae98JXktQQs4acRp0Xk7CnWigQa fRAAu7u00Mqh43o+TJLHmBJWoR8sVt128FB48uFYbnzfGSt0teNmfBABBHQ05Ny9aTd2 SOy88QkGyjSWENjmJWmp7+3d1CVEiYuX9v12ppqzk08HNViFk2Cu5cpePzVPY9R9zqEU 6ewA== X-Gm-Message-State: AOJu0Yx+KPsrxq0GZn7cxZeQ/I9fEB7AkM1cWJOTv5vNlp/YxvO7HbnR dEGKu5fUYb64At/2g/6UpjPHHCDCt5qIKw7EpZiPmw7/6b2rt+XBSh/ReVLMjK15KCw= X-Gm-Gg: ASbGnctGEsHka3QukeynKxqBTmnq3rAaGcFs7P/FeZ/t7XtncjrWIoItls0FvnQrV+F VmSRQtNX02XMX700juCxoVyHwpFhI9FzU7/3iDVLNvn7HPRiLE7fUDg1BK2vh85KVB8452vTqD+ MFtbVI9spaJHq9kGdxf2fIw9DD5SJ0Xaw8i4bjVoFZimcb54VqcrJm5Y1UNKkOvgC9X9qKPyazD 0FZt3Uy0tFeJCHKPh2rTyrEq+qVolgj9uOOty4ZtZE5+cyyGk0mcib6xA1IT0qMGBxegPYRMryM kOBNyYlMvknrMj9XXFqATe4K227afvsYhof19AHRb8KY8SAZ0znJDUjBmZybvOx3C7g0AhQA X-Google-Smtp-Source: AGHT+IEZ5Uf1bsWKgXtxFZVbT2QGv/K3mgX1pC8TlTojXb3txD4/AOyO0HFiYgMl9AXKL85bpH4v1w== X-Received: by 2002:a05:6214:2a4e:b0:6e8:f166:b19c with SMTP id 6a1803df08f44-6faf7a6d772mr42701686d6.41.1749049846868; Wed, 04 Jun 2025 08:10:46 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([101.32.222.185]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6facab00339sm97468236d6.125.2025.06.04.08.10.42 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Wed, 04 Jun 2025 08:10:46 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache Date: Wed, 4 Jun 2025 23:10:38 +0800 Message-ID: <20250604151038.21968-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.49.0 Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 2CEA71C000A X-Stat-Signature: 1xj4cqy3pifp79n4ddtaocq79p97g6hi X-Rspam-User: X-HE-Tag: 1749049848-223628 X-HE-Meta: U2FsdGVkX1+Go2clFFXQow3MfkaU0Lg9gynTCPoVRHv/q0ZTMi6SvXk4IaTyYJTf2ABGXlm36P7nHkwDt/jYh5veUeuL+aye4FQLFkyxVE130cD3FmcBYnZerY/iQZUYpDtZLlGtVn5GtFjf+oxHUlMNpzKZ6/laXRE/B0QAtnSXtjFHBpFawZVsQ9ZTLMjKz2zdpWb3ri1C8/Ohtuf8cDwYPtM1s/+gRVvfiFNsr3xpBXjwPxrNze+yFV4faVb/vzFkaa4/7BxqiiwCMgGjwyc7CbEgNoXS6PUvOWUq6dboq8rW0qIx4JJYvZtSLZCI1BU2ePvgNZV5d3qnZUQccGrL5LZ9VYPd1ycnCj4HB9QCZhOsbTxaHckaMiNQ+77xEhINDgUIq11S0hAu1xjI6D016LXsSKWpnRjahAUJ26snWfQrcjrYD3gke7W46qbhCMgwBQLyJretqxXOLDHaZWAWdQUCjdUoVhTd807EsazCJNSnRLTVfBt9XxRL5YvMfKoTQP2rtlS9kbK0s+0l+BYfnoxk4OFWVlrGHdCyQII9GmOACtW/NAhebA8c7OW08waiwTaNbeBz75wrmJwoiDlYq+cwKnNkgxwRoKg8sNu8BHe2FuQYLHncw5mAZIsCSFhdYwKrUSy+mqsN25k9INExCRmGmdQ7jYKgxcHnlL1Zn3yr/lXp2OH3s+xcvdlNjPw/fiC62k7Miu5rY904EYlhtJGuX7UHVKvs3kILRL4sQXrCz2KOH2NKg3QTCoVeCzKfoAFbqRdldae33ex9QSM3FJ9meuLFsfXn+y1VJsFY5SuaS++PoW8qlHu5thakwsKd3fKOWONYhWz2tpjGjCGD2VcLGCVZ6uAZUCecOYDIOP/6fAwtSdmyBsaRhYsWZJbB7tEK91zR/wlIBb9JlIuRG01EhLus/jTBTXfO9AfStBk7XY3JhI4GoBtyh9l8jMkgSZ24nqjx6aqiCGx XM0+ly5A JVIGNEd8eFpDVJglo4otxHQMmvqFgosdPEKjZ8mnGnq6C4c+oVb6b1gM8KRYmf85e1uvjE6CUKBqrMI4+iZVQ6ofmv4poWrUd/6RzrJN2vNFXkFWxGu7oQI6IVy2Wgg9WspfPZMENPNYhNTv/D9zS1DJKIc3Vam7IqwAWN7SKHINeYZSyN3qoB4Uf2eeVRUyyIwXLxh4xDlshTqcbajlSSno9vhEdCTo5T17IvJ1s8+EGHuhj01B5WCJhLg6ndeEJMJvOKmCdlVr5Cg8fGpjOjJ1xndGcTbUKihLuluWK5bob7X030nQG2uUgHMLjDbApOZUvT0F4TlxWd5v3l+hs9Ttfo10G5S5OJu2P7w9eI69Ih3ZlH5w2ydCvhLjZHrlSBr4hdqn/y/OfQ3NfUFUbcao+jonXGFk6Y8Vghct3bm334cNM9XtxjZH3QaIpayR6VtRAlRTUSQ5E9NI2dQIZjHMkD639wLey8znZsnDaPaW2xbqIuWlg58DB2ianacKy2nMkz/8qEv2IjAPiAcF6ofP2d+UUQ3oGCUt9D5Heg77s0h38UQdk/qyPA7eXmWu0yBHoSLDToLdRlQnteRHwks6aJUmES1p5FUQvroE9o88Zf1bDpMcc4YyZvDOtCZNMd2lLRj288nLOGlRptwYMGPG0VPUT4CLLLsXTea9LxcnzTj1q9VQJQaa9GA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and tries to move the found folio to the faulting vma. Currently, it relies on checking the PTE value to ensure that the moved folio still belongs to the src swap entry and that no new folio has been added to the swap cache, which turns out to be unreliable. While working and reviewing the swap table series with Barry, following existing races are observed and reproduced [1]: In the example below, move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 is not in the swap cache: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < interrupted> ... // folio A is a new allocated folio // and get installed into src_pte // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. // folio B is a folio in another VMA. // S1 is freed, folio B can use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < interrupted again> ... // Now S1 is free to be used again. // Now src_pte is a swap entry PTE // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced easily. This can be fixed by checking if the folio returned by filemap is the valid swap cache folio after acquiring the folio lock. Another similar race is possible: filemap_get_folio may return NULL, but folio (A) could be swapped in and then swapped out again using the same swap entry after the lookup. In such a case, folio (A) may remain in the swap cache, so it must be moved too: CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1, and S1 is not in swap cache folio = filemap_get_folio(S1) // Got NULL ... < interrupted again> ... move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // folio A is ignored !!! Fix this by checking the swap cache again after acquiring the src_pte lock. And to avoid the filemap overhead, we check swap_map directly [2]. The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so far we don't need to worry about that, since folios can only be exposed to the swap cache in the swap out path, and this is covered in this patch by checking the swap cache again after acquiring the src_pte lock. Testing with a simple C program that allocates and moves several GB of memory did not show any observable performance change. Cc: Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+tAi=8Y560K8eT=A@mail.gmail.com/ [2] Signed-off-by: Kairui Song Reviewed-by: Lokesh Gidra --- V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ Changes: - Check swap_map instead of doing a filemap lookup after acquiring the PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/ Changes: - Move the folio and swap check inside move_swap_pte to avoid skipping the check and potential overhead [ Lokesh Gidra ] - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated value. V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com/ Changes: - Add more comments and more context in commit message. mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..8253978ee0fb 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, pte_t orig_dst_pte, pte_t orig_src_pte, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl, - struct folio *src_folio) + struct folio *src_folio, + struct swap_info_struct *si, swp_entry_t entry) { + /* + * Check if the folio still belongs to the target swap entry after + * acquiring the lock. Folio can be freed in the swap cache while + * not locked. + */ + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || + entry.val != src_folio->swap.val)) + return -EAGAIN; + double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (src_folio) { folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); + } else { + /* + * Check if the swap entry is cached after acquiring the src_pte + * lock. Otherwise, we might miss a newly loaded swap cache folio. + * + * Check swap_map directly to minimize overhead, READ_ONCE is sufficient. + * We are trying to catch newly added swap cache, the only possible case is + * when a folio is swapped in and out again staying in swap cache, using the + * same entry before the PTE check above. The PTL is acquired and released + * twice, each time after updating the swap_map's flag. So holding + * the PTL here ensures we see the updated value. False positive is possible, + * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the + * cache, or during the tiny synchronization window between swap cache and + * swap_map, but it will be gone very quickly, worst result is retry jitters. + */ + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); @@ -1412,7 +1441,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, - dst_ptl, src_ptl, src_folio); + dst_ptl, src_ptl, src_folio, si, entry); } out: -- 2.49.0