From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 45C5EC5AD49 for ; Mon, 2 Jun 2025 20:17:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D9B516B032D; Mon, 2 Jun 2025 16:17:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D4C206B032E; Mon, 2 Jun 2025 16:17:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C3C106B032F; Mon, 2 Jun 2025 16:17:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A37816B032D for ; Mon, 2 Jun 2025 16:17:40 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 19B271212EF for ; Mon, 2 Jun 2025 20:17:40 +0000 (UTC) X-FDA: 83511571080.24.CD05628 Received: from mail-ed1-f46.google.com (mail-ed1-f46.google.com [209.85.208.46]) by imf07.hostedemail.com (Postfix) with ESMTP id 290664000B for ; Mon, 2 Jun 2025 20:17:37 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=lEjKrxo+; spf=pass (imf07.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748895458; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ctvG6SHuadE+00J22czmVdW+HH0EiiwR1m/kajjnHXU=; b=bfz0WrJcBkvUgUuQb6zdJlAzui+UpwkRuCzoietvDPt/bLDciKoLr76kZrNwsg47gwmttJ +tK3duZN87ROiND0U8bZX+dyjgUKHrWfqpkW6eLfgLGmVnTAoWrM9MEC4pUYBJRX4IXX27 4FbzBr8jXz9sdAr2QU3StRVFT/WDKtQ= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=lEjKrxo+; spf=pass (imf07.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.208.46 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748895458; a=rsa-sha256; cv=none; b=4ohgOTI0GWXaHv7hCHU1Uf0BMxt/fXEwdzvlPjuQ/G5eFQUOenYNx6qh62yUpOlyE9ueFH cKZvYcrJNsaesg35lctgvQ1frkBjNcEkLekQfM6a5QlFBT8vBSf8xgf1NhD1qZlvVW4+zH qpUmFp4odC586iZ6du1zVtbO9d905L4= Received: by mail-ed1-f46.google.com with SMTP id 4fb4d7f45d1cf-6024087086dso3056a12.0 for ; Mon, 02 Jun 2025 13:17:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748895456; x=1749500256; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ctvG6SHuadE+00J22czmVdW+HH0EiiwR1m/kajjnHXU=; b=lEjKrxo+ESq/dDjyNFA1IdxnptDlkqzPxpFIABiWweoJxzdVdMr2o+y8zm5bPB2/bF FD2wjP3RYKw7GHsVmXtWePhRd3+q3c0w6pszLQphkynssUvtFolP7syvzX33KCNvFiM1 4qwQG4tQnRZJAqIaHBaEBpdvyNoj4JOaZZ3S7l8IZ5lIINcuK+ERaTAQWGmU7MP02I4c GzpDbl6PdFCWiQjmp5xxqEo4d5Rlw5c9XSWznzYMDGwxOe+8za0uSgUoh3hw9V64WOBZ x+ZOPgBJQ+l1IEUDN/7qhvRDaZUZdGo1Y+Oe4vb4veE0q+rhLTmporMy1u0FGuozNfOk 6kdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748895456; x=1749500256; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ctvG6SHuadE+00J22czmVdW+HH0EiiwR1m/kajjnHXU=; b=w7hMZiw5WWObPfkHj9k+GcrexaUeKLa8/l6XOXL5q5h9ZZK1NxBvT9KyxL+ISiBw+D FigYu9iLW61jxDmoaqsds+ax/xg83A+IZIop0zO1DzzdshwbrrG+cjBD2DlwK9+4MgdV 7ztYILnTj2JMe9sPXZTv0CRQpe+9yKJQbCbRLbr9l8zlxg3gDBHc+8stiMiGKeAul30k aVB1iMKZN9oMHQIo+ksZ43Hvl4+A/XaJfWFhY6lQ/s5BuOPEGUbkEoQpCrqLvyz/qI2N q+nb0S/pRHTNznfPJ2JXIIS2+oc2Bg2uN6tMnoqv4vqZeESqbSV704pUYeE6brdjhNd9 30Jg== X-Gm-Message-State: AOJu0Yw1RcApG5YwtaKoVrMRABXpLrlR2kT9RMsN1g+a/kQQBTEX47pb ixOof4aCE94JCiDnEY/A842jsuL6ibgOhbWd8p4zlqxh7rEbrrBkMbjmNzi414efW0QLVom1U07 ybjEUvsbUD1fucjsfdf/UzjAsqAhUcgya89U/Sx9VHgIiWn266IVwQF8OyMw= X-Gm-Gg: ASbGnctYr6ohT1p0u/84USohxuUNoIJstDcvi5uYGKZndbNU6PCr4+Zqp+gILmV7Cnf m8/eKSGdctrrAfS55qLP0xyvGSq7lfz+9YCHhdKrhOulm24LSUZu6dcl0cGZ8mIDQDONyHq6ROS d2bD3Aukyoac2snsAinBnZDgmtQt4lA00Xa3iTOhUONMo5uU3jZQQTWGaosRM5AfD05j9/yKiIq /GU+NLb55sM X-Google-Smtp-Source: AGHT+IHDP71eglG2ZH1BI9bzUwIhIE/LQzMgV+zrpKTlFxjAWKpFl8zYhudXVmN6IZ3JLyYefNOPtKc2ICJJbYqQkDA= X-Received: by 2002:a05:6402:14d7:b0:600:9008:4a40 with SMTP id 4fb4d7f45d1cf-606a954f462mr18926a12.4.1748895456264; Mon, 02 Jun 2025 13:17:36 -0700 (PDT) MIME-Version: 1.0 References: <20250602181419.20478-1-ryncsn@gmail.com> In-Reply-To: <20250602181419.20478-1-ryncsn@gmail.com> From: Lokesh Gidra Date: Mon, 2 Jun 2025 13:17:24 -0700 X-Gm-Features: AX0GCFvzvU7biHtOhcabLwf8yQqS0sIUyLFczcy7v8fA9mViVFYOO4pu24h5TJA Message-ID: Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org, Kalesh Singh Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 290664000B X-Stat-Signature: 6drwczdjhh8kwjchqg3wdchfw5sho3u1 X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748895457-396910 X-HE-Meta: U2FsdGVkX1/asnpmr4/nBydAKt8WU6Zd22/vOsohhs+oOqxN6C0aSZw90HAV5Mjv5dbFzStOTinuWVq+O0pOyNZmDZt30ef+s2LPqAHv7SuF3MMbMr5vHYnEyR4tGNvz7rMi8EhEl5MkjeB8BsbsGcoWPcEGBECmsGoNQm6GTLCLWrdOxBC7u+7qOuh/Z09Y1MF3BpbVHTA67/Joyuid2hWT86ugbocvMXuys6JUZE80S/qqnVk3e3jAWWF1bLOtUhMADcEJrlbMyLKqw+kKMXXKK7Djk32TCnXGpYqTzZpwJxy28f8K9Q5gn/hCn0T9EHNFV4rFpSXB9JYBS6LJebQ2jzn+OBBbCCY97cDsX4JcVfU8dfvfvjuriCaWSmkwdiwQQI9BVYBVS1ODi90TP92/i/Igqj0n0XU2R2mu1g4TKeUyfG/k+J9lPsZK3RFJKXcUxt0E/TDfqbPbCnv0Wq8LL1oG5UtdN7usiqUZUhloIXwoZCju44sIxYiEWFKnPtByujXr/MLhdGObawX1w8IT3XiXfyM4hsDcNjkr8Unuk/4cgHFG5BhMMUkyHT3UbHK3cnd7v4k2jnZiPV8XSrOm6UEPBGB4iDKiIV2dfqEomP8dQliPZfsj+aW0RV08ifePbUJmo+jIYdHQXPIDN2SohD9kv+AxCMyh7cVEuzAkG1HVNUSeJxqpvSy6NIgo8VHUlL1/9PlaRSCmuFwxQZPz8YXosx+unB/FuvHy7JVioESzif4Eycb94jdok/gG+JYQXxnmIxcePuqM3ydXq01dcYTzWG0pLln+KaPSOr5KpbbEhh2NT71Gbl1byE5GUXxd8pENgEM6NoBjHwtkC3qD8iL2grDgwhQ4TocldWejpQcgDuCSpZISI///WH6ip1ik0F47NBnqXAaIohk8r5PhdwCPVUiQhwXpI+mOEEsj2VkLpX/PmP0/TXtB3a45hQbaWgRbsuBHhc3nxkc RXxOod0v o9MPU2aX77Bz/jH4mKo8AV5vyi3R/xyDnX+cKjhBYnT0HdRRpqLtGMsq7ayB6+jE9H1aBcUdLLnL3deYnHBjhpA0X7Ns/eimWjixHzS6Ej3F/eEVZFVcKr90hf9mVCmmbTcQs/SyWLwmezECgvkomD6P5rNllWogAd8l7sHtNQLr+25D2s+aapfQnS2+9dum+HwWMc7dJgc5MrsoabiJgGH4pAaQEB7yBIauTh5b9RqWuHpIK8OcNYKQ7OSEbavUd5e2Kk5lTJSiUiNPaIkIMQzZxFrht2CXEa70qnvgEAqezGXabzTdsB1xcM8qlDJi17g/ZRgEfBNsuqs4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 2, 2025 at 11:14=E2=80=AFAM Kairui Song wrot= e: > > From: Kairui Song > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > lookup, and try to move the found folio to the faulting vma when. > Currently, it relies on the PTE value check to ensure the moved folio > still belongs to the src swap entry, which turns out is not reliable. > > While working and reviewing the swap table series with Barry, following > existing race is observed and reproduced [1]: > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > CPU1 CPU2 > userfaultfd_move > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > // Here it got entry =3D S1 > ... < Somehow interrupted> ... > > // folio A is just a new allocated fol= io > // and get installed into src_pte > > // src_pte now points to folio A, S1 > // has swap count =3D=3D 0, it can be = freed > // by folio_swap_swap or swap > // allocator's reclaim. > > // folio B is a folio in another VMA. > > // S1 is freed, folio B could use it > // for swap out with no problem. > ... > folio =3D filemap_get_folio(S1) > // Got folio B here !!! > ... < Somehow interrupted again> ... > > // Now S1 is free to be used again. > > // Now src_pte is a swap entry pte > // holding S1 again. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > // Check passed because src_pte =3D=3D S1 > folio_move_anon_rmap(...) > // Moved invalid folio B here !!! > > The race window is very short and requires multiple collisions of > multiple rare events, so it's very unlikely to happen, but with a > deliberately constructed reproducer and increased time window, it can be > reproduced [1]. > > It's also possible that folio (A) is swapped in, and swapped out again > after the filemap_get_folio lookup, in such case folio (A) may stay in > swap cache so it needs to be moved too. In this case we should also try > again so kernel won't miss a folio move. > > Fix this by checking if the folio is the valid swap cache folio after > acquiring the folio lock, and checking the swap cache again after > acquiring the src_pte lock. > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > we don't need to worry about that since folios only might get exposed to > swap cache in the swap out path, and it's covered in this patch too by > checking the swap cache again after acquiring src_pte lock. > > Testing with a simple C program to allocate and move several GB of memory > did not show any observable performance change. > > Cc: > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+EJ= t+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > Signed-off-by: Kairui Song Reviewed-by: Lokesh Gidra > > --- > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.= com/ > Changes: > - Check swap_map instead of doing a filemap lookup after acquiring the > PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gid= ra ] > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.= com/ > Changes: > - Move the folio and swap check inside move_swap_pte to avoid skipping > the check and potential overhead [ Lokesh Gidra ] > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated > value. > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..5dc05346e360 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > pte_t orig_dst_pte, pte_t orig_src_pte, > pmd_t *dst_pmd, pmd_t dst_pmdval, > spinlock_t *dst_ptl, spinlock_t *src_ptl, > - struct folio *src_folio) > + struct folio *src_folio, > + struct swap_info_struct *si, swp_entry_t entry) > { > + /* > + * Check if the folio still belongs to the target swap entry afte= r > + * acquiring the lock. Folio can be freed in the swap cache while > + * not locked. > + */ > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > + entry.val !=3D src_folio->swap.val)) > + return -EAGAIN; > + > double_pt_lock(dst_ptl, src_ptl); > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src= _pte, > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, str= uct vm_area_struct *dst_vma, > if (src_folio) { > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > + } else { > + /* > + * Check if the swap entry is cached after acquiring the = src_pte > + * lock. Or we might miss a new loaded swap cache folio. > + */ > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS= _CACHE) { > + double_pt_unlock(dst_ptl, src_ptl); > + return -EAGAIN; > + } > } > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > @@ -1412,7 +1431,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd, > } > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, ds= t_pte, src_pte, > orig_dst_pte, orig_src_pte, dst_pmd, dst_= pmdval, > - dst_ptl, src_ptl, src_folio); > + dst_ptl, src_ptl, src_folio, si, entry); > } > > out: > -- > 2.49.0 >