From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A3B4C54FB3 for ; Sun, 1 Jun 2025 20:01:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B7A36B020A; Sun, 1 Jun 2025 16:01:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58F236B020C; Sun, 1 Jun 2025 16:01:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4A4A86B020D; Sun, 1 Jun 2025 16:01:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2AD536B020A for ; Sun, 1 Jun 2025 16:01:33 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 712EABBE67 for ; Sun, 1 Jun 2025 20:01:32 +0000 (UTC) X-FDA: 83507901624.17.44F480C Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by imf13.hostedemail.com (Postfix) with ESMTP id 922662000A for ; Sun, 1 Jun 2025 20:01:30 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VVPnYexk; spf=pass (imf13.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748808090; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=j5wfQviA8yPzjds+RQSjBciKS/dDPs2tg7uDFAy3E5U=; b=DfSyIqOGhf5hxuGKuMwwP7CT08ru8zOFmuBAwwgE9g+xgxroxlPuk8A4h9+gJ6mTReyBK2 dvSJyYw1D/vUbGQnmzVPXNhaVKP8U3CCRrgjByo2hWXDyIwR5CqxblH/+xChNNFcZJfrH+ 3IsLTB5eL7xfLx1D7zl/Qj1HCGuS8Pg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748808090; a=rsa-sha256; cv=none; b=ULQdXY4ReDmX5Gg9zSzJKaZNuy+2A6liwpKyLUxCymYza/HW51m8tONonJLZSUp9Smi1Yb psGnhgsvEQFURUCM1ngeSxoYwJH63VoL8rhmfgTEHJahWl7aAOdKbmu+Y2vojD9LkRpVPt WNYs9rDRhZaglgOhz6DfmnQ2Tla0vJA= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VVPnYexk; spf=pass (imf13.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-f173.google.com with SMTP id d2e1a72fcca58-742c9563fafso2976956b3a.0 for ; Sun, 01 Jun 2025 13:01:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748808088; x=1749412888; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=j5wfQviA8yPzjds+RQSjBciKS/dDPs2tg7uDFAy3E5U=; b=VVPnYexkfdQEDpDFbHAr3xHnIMCGQyECtBTD3V5FQyXfJLH0YX4oL2cDEKvMjsxRUl XpubUfueUaBu/ePBhnSV7pfzxVPN7AHblsAl6MxKIeKgSjPqqumfhi615KQFWLEqs9hG OeIk9DiWa0177Fl9b0OJHyiFfrFUFLi+hNAYItaQPYeQzdXSaepxlaUbQ60e5crYu46A DPfD3FKX/AjwUrOJVwfbep5LghjFMzz1v/ewW3NhrQM8x5GbQ9T15cbJyvZzOOZwpXdD M4mjxkYeoioP00fGtPIj5Ve9AW7WvHKTwloUp2C515KGF5iXlzVpv444XMcEJ3MBi5FD bdTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748808088; x=1749412888; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=j5wfQviA8yPzjds+RQSjBciKS/dDPs2tg7uDFAy3E5U=; b=UKhrBsj0Pf0MqCv37dpp8/V6OMO473a64v2xXwHI6n9wYqiWH1jcP3RHthJf5RX6X5 tE21c6HJMH9idv+1T+oQI/WtYYN82eOPq3SNZ9i9azSbXeX52s4/cI6Kdv6QQ6oFfTbI TV8QKvSySz3Lu0dA7CH8Y6GjtDwrBeegfhfNSQ+KShRmomcmBowHxQNSx3BtkXF00DQU 2UuZQPqPpKFzc0/5WWG1ME6TgI2CTWZe3dtjrZqILh0NdxWdWG/N3seyKI/KcyBcNkfZ bwd6OHzBF8tdgD9xcBi1fck6uIyAU9CPc5JnKV6FB89DSI4dG2Q3IFiJ7T8n9r2yC7UA QHXw== X-Gm-Message-State: AOJu0Ywp+Kd456aYmK/URAzi+X8vVNnCHGW9+LdmunzEL0zRE/u0yJOw KQySjANM/LVhdonXm6OUDkIxsEZilGpYhyzrjcjEUCWknGcRDlIdGPeEiAnh3o6Zs04= X-Gm-Gg: ASbGncsef7sTg4JTk1KtjaHfiDhA9Jc9Xx1V7CGonYjXnobwQnoa2/8B7aQxuj7I/IG bqhdd9q9YUtQS86Nt9t8yLrqQTUH5n1HQL9lW8scffWTBzOiLe7ml+TWVYJQHRxBCQekr1O6HeI NC6Cd0aKdk3l1RUZ0xML1xpCTxw3PBrq9mlYXRFBCE11nVjGaK3+m8nPPYHuv3OGM75+Ibiv5PI PqSFgpW2UhNAo8LLRPGqo+wbmumOgW/ZNMouq6PRG7LxXYEHhnpf1FxFwkPnrmt/hlzFeVwCCaS 386KRORhdZNvZD7+tlqUgqjwe/6FPjN0onmIgpcLoYDS9w92/kMBqiPzpR73HcEQQ41qQLvfLiV Up5te5x0= X-Google-Smtp-Source: AGHT+IG4Lc9FhigZUgWldD7scIQD8VLnkGzOxYeQGikRL6B9oYrq4ZvfIN3ryGecjR7FAVoLjo/5vA== X-Received: by 2002:a17:90a:e7cd:b0:2f8:34df:5652 with SMTP id 98e67ed59e1d1-312417369b3mr15855412a91.21.1748808088316; Sun, 01 Jun 2025 13:01:28 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.120.101]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3124e30b832sm4356137a91.33.2025.06.01.13.01.24 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sun, 01 Jun 2025 13:01:27 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v2] mm: userfaultfd: fix race of userfaultfd_move and swap cache Date: Mon, 2 Jun 2025 04:01:08 +0800 Message-ID: <20250601200108.23186-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.49.0 Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 922662000A X-Stat-Signature: uaxktdgj5mo9drkfjgkgf8g6sizipkp5 X-Rspam-User: X-HE-Tag: 1748808090-609308 X-HE-Meta: U2FsdGVkX18cvCdBeC5Prz8JiQo/1pFlDojfvWWlcKrW/0WTZpQwFkYQC5AWVi306paMg49Blt3ddjmHUpFmjeu8jTNZztErzd6e+YpJWmO7EcBbMOfkCftImXiIA8T1XSrsbshB4Pnpmoh6qzBXpllPCNWFGaBc1zeKrwC4GJTk5fdpkPthW57wcfq4c6gd2P1+4yIU65C6yNheenahYZhszE6bTLQjeoaDOtP7C/0tg/5Xo7Qu8AanLvKzqi6OeHgK9ipW5uaHUnttYTJHJtHwg3zTSOqpoyY7JEuyOO1FK1gewikh1sqwfP8rJ6CikYM61cuAVij3rkqKcJPrv1u2wAYjeXupNuRTUfLFHUsVYg/Na1igMdoDTEN8FrQbH78HOaUPojwleQqiDyJFhYSlQsbwul7HkcC2XDsoWatzCYDUkoPIrZhcdBmSsGeYfUKVF4GXDHQa4Ncoxhm7aaorkyQiHVtccjzHVqneFB+UXbRVV0mUUFyWYZNzLfLXb+5wzF8/yBekhy92X8YVZ5dJ/l5tbxS0skqGnx12V8TBH0s86/g3tZRVvAqajzLk7pz25s59G9JyCKnWS1nAGkq1IiEVzRz+jCBsiSGqoMX2ZyaeV7+dKijnXWgtYZWKhzGLPdIxpLTc5M+0CeRwo/0ZRop1VFaNk7dC2tYBfQBHBbWwCXFWf1Ux/OAgWaUawwC3yk+lTVCDlArfh8c1GdrbT3hDtlszJlMpPi/iJdhBMoBYvZoa1fBdt7j7K8xO0Z70gnufHkVP9loWcbB5M4tzZvyq6bE7Ziwy63cEYlBCUACGEkcV983xYZZJcVU7YX9Oexh/J5BrkPkN83CPSV4aN0PlthdpFYAjseJnqWD6H+210dczNAR/Z1SXVlQmIjv+35mufTcvIqKY8F8G38laQahtf6cFGlCxzwSk/8ZIDl/qsHFDqJXQfO1xybICFfYkJkKsqYtC3uMu9/2 32ICbLqd x17Mux3I1YJcSZswS/W42LX56AYvUUUjsPJzD+y7Z+TmJfgKRXxItwfcUSBfuMoHmrNMc1Hm+JfBneR7mE8gsT8hH8BXSm8DXFuCXTDpnOfZEaAUjhktcZLUj1er0jahl5Okh0NSYrgbAz1Gvxz3JixZfDr9WDBeqSWoy4nnMg2Oj9uLKinZUwmXemKQnw7MUzXLJWZT4+IYFDR0JzQnjDqBTCZoN+/NQMwhnFMl0PP1p6NNubjwtIw01/78tOr1SA/rS7iaY8gBaQV/nCzTS+1a1AIMUIel4JtKCSqBAwvOTlEJLEoVpXz3N/Wz31i6s+AGNaGG/GoiJJufi9LaqnR5NcxnTI46C3B5lNq7CNi8aXFEgmpwTVxo5048kcBY/4GmFt32xmcO8MKJqCsQC8ntzcAoA3dKd876fwAXCGg+NgJGgb7j+wivV5Ckuq0RTQ/oyIjSo+zVJYGwZ1VjPKziXkUzBl2+kSlEhlTVnKBtboxNvKKOV7Qhiwv4J0f4lUDpn46hcnDPsBf0F0YciooAl3WGzbWnV9ctG/elsWJt3KQKh3qyXrzGXFaa6nzHyDcD9bTOapt0KlDk4euAw+X1tv7jJZk13jlC0jJeDXUJsS3/G9IGI89jTJJOYj0INC0tJp3+iTFacruRSDIWv0c0urJhiyY2xU4sL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache lookup, and try to move the found folio to the faulting vma when. Currently, it relies on the PTE value check to ensure the moved folio still belongs to the src swap entry, which turns out is not reliable. While working and reviewing the swap table series with Barry, following existing race is observed and reproduced [1]: ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) CPU1 CPU2 userfaultfd_move move_pages_pte() entry = pte_to_swp_entry(orig_src_pte); // Here it got entry = S1 ... < Somehow interrupted> ... // folio A is just a new allocated folio // and get installed into src_pte // src_pte now points to folio A, S1 // has swap count == 0, it can be freed // by folio_swap_swap or swap // allocator's reclaim. // folio B is a folio in another VMA. // S1 is freed, folio B could use it // for swap out with no problem. ... folio = filemap_get_folio(S1) // Got folio B here !!! ... < Somehow interrupted again> ... // Now S1 is free to be used again. // Now src_pte is a swap entry pte // holding S1 again. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable // Check passed because src_pte == S1 folio_move_anon_rmap(...) // Moved invalid folio B here !!! The race window is very short and requires multiple collisions of multiple rare events, so it's very unlikely to happen, but with a deliberately constructed reproducer and increased time window, it can be reproduced [1]. It's also possible that folio (A) is swapped in, and swapped out again after the filemap_get_folio lookup, in such case folio (A) may stay in swap cache so it needs to be moved too. In this case we should also try again so kernel won't miss a folio move. Fix this by checking if the folio is the valid swap cache folio after acquiring the folio lock, and checking the swap cache again after acquiring the src_pte lock. SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far we don't need to worry about that since folios only might get exposed to swap cache in the swap out path, and it's covered in this patch too by checking the swap cache again after acquiring src_pte lock. Testing with a simple C program to allocate and move several GB of memory did not show any observable performance change. Cc: Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] Signed-off-by: Kairui Song --- V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ Changes: - Check swap_map instead of doing a filemap lookup after acquiring the PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] mm/userfaultfd.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..a74ede04996c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1084,8 +1084,11 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, pte_t orig_dst_pte, pte_t orig_src_pte, pmd_t *dst_pmd, pmd_t dst_pmdval, spinlock_t *dst_ptl, spinlock_t *src_ptl, - struct folio *src_folio) + struct folio *src_folio, + struct swap_info_struct *si) { + swp_entry_t entry; + double_pt_lock(dst_ptl, src_ptl); if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, @@ -1102,6 +1105,16 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (src_folio) { folio_move_anon_rmap(src_folio, dst_vma); src_folio->index = linear_page_index(dst_vma, dst_addr); + } else { + /* + * Check if the swap entry is cached after acquiring the src_pte + * lock. Or we might miss a new loaded swap cache folio. + */ + entry = pte_to_swp_entry(orig_src_pte); + if (si->swap_map[swp_offset(entry)] & SWAP_HAS_CACHE) { + double_pt_unlock(dst_ptl, src_ptl); + return -EAGAIN; + } } orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte); @@ -1409,10 +1422,20 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, folio_lock(src_folio); goto retry; } + /* + * Check if the folio still belongs to the target swap entry after + * acquiring the lock. Folio can be freed in the swap cache while + * not locked. + */ + if (unlikely(!folio_test_swapcache(folio) || + entry.val != folio->swap.val)) { + err = -EAGAIN; + goto out; + } } err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte, orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval, - dst_ptl, src_ptl, src_folio); + dst_ptl, src_ptl, src_folio, si); } out: -- 2.49.0