From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36C60C54FB3 for ; Mon, 2 Jun 2025 01:43:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 172B06B0248; Sun, 1 Jun 2025 21:43:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 14AE46B024A; Sun, 1 Jun 2025 21:43:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0626A6B024B; Sun, 1 Jun 2025 21:43:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D5DDC6B0248 for ; Sun, 1 Jun 2025 21:43:40 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 608971619BC for ; Mon, 2 Jun 2025 01:43:40 +0000 (UTC) X-FDA: 83508763800.03.E6302B5 Received: from mail-ed1-f50.google.com (mail-ed1-f50.google.com [209.85.208.50]) by imf12.hostedemail.com (Postfix) with ESMTP id A076D40003 for ; Mon, 2 Jun 2025 01:43:38 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="3P7/Q90G"; spf=pass (imf12.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748828618; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wYvGcWQJ9d1k2IJ1WZ8pmBWN9v/WGRaKrwMXLLXLqzw=; b=wDpfaZc7hoSX1OClgs3dK5ZzRn0dQ0drXJP6JgBEyUHAdVS3g/P+2oF06Eg44eFxkAfMmr lVx1TJvwWmHrGtOs33xPeGglUjVPt/b7HiTfTnxJdpHE0P0YzWayBu9nzLW38h6lJkHVSw l6PueDkgUvpSHmzconqDdMfw8kM9Zjk= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="3P7/Q90G"; spf=pass (imf12.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.50 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748828618; a=rsa-sha256; cv=none; b=VR/wLa+ix8FTa0HGtnTqxFkhxsNF5cudA9pbHYIWidNmt3wDvjByI2UHFIqAwtIukLMMU8 ud2as06eubBOno292yQwXgr0rrNE5que4JvQoxc6ILOQtRgP9FTVPpFCNZpf7kfYFt4o6p ubN9T04JXAEF6pmBW3uNUTki6ug5xuw= Received: by mail-ed1-f50.google.com with SMTP id 4fb4d7f45d1cf-5f438523d6fso7821a12.1 for ; Sun, 01 Jun 2025 18:43:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748828617; x=1749433417; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wYvGcWQJ9d1k2IJ1WZ8pmBWN9v/WGRaKrwMXLLXLqzw=; b=3P7/Q90GkbrmBAw8igDGkXP9CiNVe5W3/46Jranur2wXZwTvVSEk59Een3+VarC/3D zF6QpNqzvzLODF/ZgrHY4FO495B4ccjf5ggq2fPSUowt3N6CwkKcfmHkuEPYc+x9AjpC w/e1ZQJx4Zr4w60cql7Nd8QErAm9y0TD0BCV2PTW70EX9uTYdl+PouoCHI/XPqGE4ee+ HF5ptlinGXTTql+mllAD3DEeo/fgO9TpMu7GvASVix9P+PMyRvT59+NH6akQcFwXsYYt goFOGREmKSELdq58TtPkMuoGS1fwtr+kAnZMO2KAXdknze8OK//3rL6NYWdeuQrdK5oI d2FA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748828617; x=1749433417; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wYvGcWQJ9d1k2IJ1WZ8pmBWN9v/WGRaKrwMXLLXLqzw=; b=q81OcCXXeWAuXfg/XKKPRTFkG84NeXmfB9dyLceUOmdZEv5ugJFq/C4gWkv4QRZu2N yKFAx68XELDT2oDK8mk/4iNUrXcE9iYjd7ELF+sr2P4qdajNwOGz8sdnY/XADA7qxUBG SX/n8yIHLNZpXyqUcOD3r4KzhbBLVaccEpNG/0pS0+0khh6AkStCGGDAtYMcngvKuydA 4Sbd4TDD+Sn3Yx6BXg5TfXZV5Z5LwqZHAky5XMOkrYWpTezpaH19Kt6bmKN7dcKWb47R 4jaUOG6A+wXqNVnX3mo1Rps7aUZhf5ClMXgcu0Bl6JuXk1HOdk9WxJ18Jy7OWgKVpkcZ VPSw== X-Gm-Message-State: AOJu0YxQ2+cfEq48SUER8kMwxZwqobBqbgqlCtp5wBQIDsDL9YoBWBkB nftA5DEZ0tAID78yvB7n57w2xQ34vG/C7CAlzQvzDdyUPML7wbOnWv9Gj8hidaaFS5HhWNPpeMS t14VHG3pb6r3midnmboynmNxXRWyzfcMI8GnIgFuA X-Gm-Gg: ASbGnctWzmE4yBKZZt+eHMKgFalda/pZ0f3/gWvor8ZhxFlEmCvcL28iV+D5/FvtPJG L5vklDjGOhADNjste70Gi+D7loXMa/ujybg3HBQRJ3/6Nby/rLvwNWDeATtYcEUcPSBMs9HQfzn eRfgpjeis1+u4JTCoqtjuK7+H1AbeCgz0VmVRQmVsI2j33fZkko1a16twZY70exiDZFX1dlwCkJ fmm X-Google-Smtp-Source: AGHT+IFMz4mfq5Y8/o5Yf0vlPV+OHhh4I+9hHTwnCw8e89qHhyt4L6kmwID2mb8bj9joV5ZdYADHSGEV4nVYb0ujbYU= X-Received: by 2002:a05:6402:1852:b0:5fd:2041:88f7 with SMTP id 4fb4d7f45d1cf-605ad29d9d5mr95107a12.2.1748828616702; Sun, 01 Jun 2025 18:43:36 -0700 (PDT) MIME-Version: 1.0 References: <20250601200108.23186-1-ryncsn@gmail.com> In-Reply-To: <20250601200108.23186-1-ryncsn@gmail.com> From: Lokesh Gidra Date: Sun, 1 Jun 2025 18:43:25 -0700 X-Gm-Features: AX0GCFsO6EmUtlJ041E_1X1dTrIiqSlIvY3mMfgpc6Yn0ldxuA-DlfU3YYsdO14 Message-ID: Subject: Re: [PATCH v2] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: A076D40003 X-Rspamd-Server: rspam09 X-Stat-Signature: o835ry9psfcgccrto5ox1z9ofmgo6oeq X-HE-Tag: 1748828618-602885 X-HE-Meta: U2FsdGVkX1/irnd65pfPUEBQobhX5zKQyVgh3Dsu6JKd5FFyrupazJ9CYHR5AlHKtbEN/89wDi3SVQDrS9wsAToloL+b/PHfLb3qvfndPgFXmciH9HXdsQOldrgV2GhXROObM5PIofhNi9xj/RblTyU8cGk/KrKLerXlr/bDzl10tjHiaxNKMxSqwGA7cscW0SIsV4d5t57As4Nlx3t8gNe0bZdb3Cp/ISONtQQdh2vk5Tqhkk+0NFz4ARMrarG46PAPMbcB93FG7YCEnVX9cD0TSuyTqZwcsLNtU7wOmxiL3jsrKkBx3fpPC6p0QKBMUCDTETyO8MK3jVlbk0UDB5sCSeEPTfx1jQSeBk6A15H7hDGV4mjz2bgqkNJXFfIVPjHrfLh5ISLGxglAaLy95RuVMzCe0GJoUTyyLOD+OCPWp5NSq5OC9i7NJf4JTbB2ajfe8NHx0C/NgX2ciRUOZn0HPprRQqbwv7mgGtcm+Gij3N+NvtPYdqrqxAdz6jkSh62wBvmfC0jh+GfKup19RcW5nGhdD1X330WaM5SltpmWgSU4/wQp40C94ksss/cu0HtxJv0ZtDSMzEpasHmYYvP+kV1hHcbEiB+RaIkx8ojmC+VL2xXq8P7vos+4QPIvQlEVu/hxnEE8G1nArI8SdtXWYC4Rs7EIdwq6Vhegb3UgRzVZQrquFwUOsyQ9vttRIGaCV6k+iK4A0hO9fqOOYl4AI7fBveq7Wfc5PxVSgtXqo6Yz3aHL3CNAiwzXYvT0DXcn+oFz7tSn3uus8zr2RhxxbgCM9NXIMPrQhyBZrigr5t5FmW3mP352AX+T45tg+lgytU1A1w3nLougutYxC71V3IC0DxAiAPXLVs9QTM+wM6vZ9Xmbj4N/LaC9fD8VbN4JImiKrwNhsxv7eOpR9YbOq4oX7mq8sZBQVTibPwQAg5TcTlV3B+76ADlWDH3FNF+8GprDoZA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sun, Jun 1, 2025 at 1:01=E2=80=AFPM Kairui Song wrote= : > > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > // Here it got entry =3D S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated fol= io > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count =3D=3D 0, it can be = freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio =3D filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte =3D=3D S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. > > Testing with a simple C program to allocate and move several GB of memory > did not show any observable performance change. > > Cc: > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+EJ= t+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song > > --- > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.= com/ > Changes: > - Check swap_map instead of doing a filemap lookup after acquiring the > PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gid= ra ] > > mm/userfaultfd.c | 27 +++++++++++++++++++++++++-- > 1 file changed, 25 insertions(+), 2 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..a74ede04996c 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1084,8 +1084,11 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > pte_t orig_dst_pte, pte_t orig_src_pte, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > - struct folio *src_folio) > + struct folio *src_folio, > + struct swap_info_struct *si) > { > + swp_entry_t entry; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > @@ -1102,6 +1105,16 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > + } else { > + /* > + * Check if the swap entry is cached after acquiring the = src_pte > + * lock. Or we might miss a new loaded swap cache folio. > + */ > + entry =3D pte_to_swp_entry(orig_src_pte); Can we pass this also from move_pages_pte()? It would be great to minimize PTL critical section. > + if (si->swap_map[swp_offset(entry)] & SWAP_HAS_CACHE) { > + double_pt_unlock(dst_ptl, src_ptl); > + return -EAGAIN; > + } > } > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1409,10 +1422,20 @@ static int move_pages_pte(struct mm_struct *mm, p= md_t *dst_pmd, pmd_t *src_pmd, > folio_lock(src_folio); > goto retry; > } > + /* > + * Check if the folio still belongs to the target= swap entry after > + * acquiring the lock. Folio can be freed in the = swap cache while > + * not locked. > + */ > + if (unlikely(!folio_test_swapcache(folio) || > + entry.val !=3D folio->swap.val)) { > + err =3D -EAGAIN; > + goto out; > + } This check will get skipped if the folio was locked by folio_lock() rather than folio_trylock(). It seems to me that the correct way to do this is to check outside this if block (right before calling move_swap_pte()) and with 'src_folio' instead of 'folio'. Even better, if you pass 'entry' as well to move_swap_pte() as per my previous comment, then you can move this check also in move_swap_pte(). > } > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > - dst_ptl, src_ptl, src_folio); > + dst_ptl, src_ptl, src_folio, si); > } > > out: > -- > 2.49.0 >