From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B6F6C677C4 for ; Wed, 11 Jun 2025 05:16:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 719CE6B0088; Wed, 11 Jun 2025 01:16:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6CA9F6B0089; Wed, 11 Jun 2025 01:16:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5E1356B008A; Wed, 11 Jun 2025 01:16:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 3A5376B0088 for ; Wed, 11 Jun 2025 01:16:47 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A404DBEFAE for ; Wed, 11 Jun 2025 05:16:46 +0000 (UTC) X-FDA: 83541960012.22.BDCF5D5 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf01.hostedemail.com (Postfix) with ESMTP id AA7CF40002 for ; Wed, 11 Jun 2025 05:16:44 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QYEjj3wl; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749619004; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/WbUYvBV5w7iIvRbYFpOg/UyF3d6Aj3iPpcFOvmAbqs=; b=EpXFvH5y0zWs2lis/d5vhw+WoGoEMIUG/lyYCXF6LlOeACkTnAu70gWHRvkJJVIph4/3Za hw5+mlngi5pWXJZebq85uOefmMdvXO4AX2tLE9ndtxcCNFYKsrffeb9G9jA0eJn/IiSKVS 98k5GZp4It9sEv2i8oJCBSbgdeACQvY= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QYEjj3wl; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749619004; a=rsa-sha256; cv=none; b=RPOu71ma8ZA0jEOkoNkwufmeLndUlI9nKBbnfyREMYBL1URAU5cHZpA0vutZIPIGeEjWmk 19oYK+qKVeIwxRR1BYNXucbtQN2+bl8NYUtmm1RjvwK3bOavur0yrbXJgCIn+3888tktGb iROcGwmW0dGSpw8ksoN15FMI61LkZY8= Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-310447fe59aso60798541fa.0 for ; Tue, 10 Jun 2025 22:16:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749619003; x=1750223803; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/WbUYvBV5w7iIvRbYFpOg/UyF3d6Aj3iPpcFOvmAbqs=; b=QYEjj3wlX95NVMFPzj7LrUYN+EVsvXHjT5oP+f8zxVGse04MvrvSeMAz9TrjMwW1Tp ElMkc5znHBAizfzJfo/ZVqmyOxi/mXqPIvq2oSX5bJKl/ItXq8hhqunRkSt3rIEmQEVl 1KWQyblidcqGJnx8V5oQekR3t2m4QVPLrbiTak6XiGualQpHLJ7MZncsfQ18m5sSNIPZ ekFHh8HzwQJrRl1lu8PoJYb2ByNNh73TdOTdUWSnDfVssJDpB12X1VwlVf/146XIlQLq oN7N9LHSny7dJijNAHAea9LCL06O24SUkY6onb5wEhbSGROpSjByAJZNvYge/W4EwdLN i4ZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749619003; x=1750223803; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/WbUYvBV5w7iIvRbYFpOg/UyF3d6Aj3iPpcFOvmAbqs=; b=BMUjSHxiNN4l1Y371VBJ97ggFk65dktvIWwVFTQhMOvMS2fgAUuZAunsSwrd8h6fWW kRBsKR/kymFXmmQETknML3pK6b3dgFKiJge84Q/lZoRuvgkpa9eC1clfZpWF9WEusysH QXBdlOeTsIIXoBJ1SdOkHfFH0PgtenM38z1In11lBaLCLj38+cnKnSyi1DTDraMYS5ES t4nfHYNmkz8kZpeKIW4hYooU1tIOa/aRSU9QdCBp4FVZODC0v4PlNfXbgdqD09dKm4DT /sfGvl3vIUdpnV7HB+yBFtp8bLf5DtbF/g8fB1AuaUn5KqIqtAVnufgWN7/edmw9aw1E NL8g== X-Gm-Message-State: AOJu0YxHFZGT61rx1B6mCWolp/+b5bUg2qGoLh8cavDBRkP6SqUm5RdK yphRD50//QnYbCIwZR0KpxIY1G9ojCuaL8aXRI4ATd8vr49KqSIHPbx1DRKq1HOFl3ekG0pF+iC K/Jgf2ycvg0Dmp+q2vhyraPogoF22c/0= X-Gm-Gg: ASbGnctvNlanwVGksSGyf8dN8XYSo2BqnUXkHwZ1nLi17W7Rjo6mC6lRdGlyh7OH6eH DicXSd6IieaoyPJc4UXg/Yyng6L0Gv41mkKQcfMXWbfUS70CwmgEzoSEdHx6ysDhRCE8ahACEjz Xt640EZ9vo17vrpfirCy5zAeuluZtMfWpxuAexbh+udyQ= X-Google-Smtp-Source: AGHT+IHAK3hZ2AAdYks83VACnJX54IgNwo73ZfcbGiW8zGfVZvURJSODOI9JvOuDHZxurbfBaFeIm4p9HJeqpDFxDWo= X-Received: by 2002:a2e:2e06:0:b0:32a:7a12:9286 with SMTP id 38308e7fff4ca-32b21d1bfacmr3447611fa.31.1749619002404; Tue, 10 Jun 2025 22:16:42 -0700 (PDT) MIME-Version: 1.0 References: <20250604151038.21968-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Wed, 11 Jun 2025 13:16:25 +0800 X-Gm-Features: AX0GCFuHLf0sChQXW_NasvtYEsRWlK5vN2xRpQMPyVovYeE9O1vpb1oZMD2uTlA Message-ID: Subject: Re: [PATCH v4] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Chris Li , Andrew Morton Cc: linux-mm@kvack.org, Barry Song <21cnbao@gmail.com>, Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: AA7CF40002 X-Stat-Signature: rausnuotdtocpmtsi4whsbh9dmcf9q65 X-Rspam-User: X-HE-Tag: 1749619004-12877 X-HE-Meta: U2FsdGVkX18VckftllozgDpY3dW6dDXyQYCej+6FZTWCKpzjotY7q+I2i3hz1ge4lBvJAeSfK2VuHAhfaPymPXjD6im8I9rat8mAZPk0sTsuF1GID6AtPhpmIXRJRYFmKGcSTb31rSjq8bzkNmcW/ajJolKUEfAJPmFjsVK4X+sF24PXCcrEGnLy7mQEbKskwdsSJMbMFIM/vowzemoxTBs7xs+mFaWKM8cPbmiwYBoDkq7XyiAoAVr6kt1Qeoa9aIsU1oydAM+5LvI23ySzDkoc9FzdU/TkphnI/R6JVZRUiKJKaUMn64RqeFQ6n6YTISiofaqvXMtHYezkmNZklYGJmlThrKq7Qc2YOf3nrbHWHB675zox3Mtam/y088ahwUOfl4IDbLbdbhtPBOc+eQooxxWQyHiFlz6hIyf9YN7kf2ofbArusauCuLiy4nLV6ktNjl28s1c/qmJdxlZkUJZSamf5z0LRbxgL+IN2s2y8etJL7fqx3R6hhf/5xEL8AvNgGSB5oPEnQ0FUyDxhfkpfjmkoO4iXW5HgffbwEKuY9WeRyGNpCeZ5upeoJq0jynkVuhuBJoqG+p1oCH6ZqvbwFsHdBPCnG0ENdeU+410epWSHhXoHii7gVx1fT6oQHN3vK198oEoNbpHfY7oEa7GLotAT5EhX/ijrIMrSmskAMxZ/7EPyjKFXGC4MkLENjxhwJDz+ktJUa6OZNAGk6G6Culh0v175+YUpUKiplX5V0kYSCZNnl2H0oNxdeW5AlQTtdC60Cc49g00Vwrzq5rxaNoR7O0kMAESqYtqAkLvftH0stAxmA0lYXWxzTqaM9guykl2peQd2VWV2EC9j0X/r584zCU2WmuEyvtVBkuzX8vfclBexL3vf66SZMB0s+3oIkuWTPvXxEiyVkrdPkTw2H7ZIf4ljs+Rjz8ZTuCRGiNpIGUL36O+XbiTqGLCeYAC64TTdID8sMCl+Qx3 6fCpLQV8 AVNvM8JIazPGumGVQA9Dk2VIV/Excu1l03wKXZSuARVDO1fIYiGVPPYDkCosCyEnuAvRnKx8D/MBYoNg0Z0AqV0lDcNeJBBhnHk1a6Y6apCrhJsHRZI6KBIx4sk2Vb7p8gZXzKluWNAjSdL1uyL6jxQEIRvg/K3F3DghIzjHLjRCo8OmT4zyreIO8Bjf2Wm4lFVSu9Ipu658BZa6CNR9KUm6wCR67KVsU5IlDymUpZusnyfyUjQyqQNU36+mfnvO/ouZS1mabqh8C8fyEChfthUPPqSEy8Y7goS3qn1mbUaw8gZwqkUS6+5YtVFn3cMMXGJpBOqJI74Cr7h/usmcd5cklDmTe/nols2IpAIhdq7W81NobuFDnmaN+UI14HazZVU+d+wM5HPE5DC4dSK9QqzdHd56Aqm0Elgz0FuGkdYXr6Uu1gAR5El2rxebKBmX7B5Q6xfcCcgQo8lo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jun 6, 2025 at 8:15=E2=80=AFAM Chris Li wrote: > > On Wed, Jun 4, 2025 at 8:10=E2=80=AFAM Kairui Song wro= te: > > > > From: Kairui Song > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap > > cache lookup, and tries to move the found folio to the faulting vma. > > Currently, it relies on checking the PTE value to ensure that the moved > > folio still belongs to the src swap entry and that no new folio has > > been added to the swap cache, which turns out to be unreliable. > > > > While working and reviewing the swap table series with Barry, following > > existing races are observed and reproduced [1]: > > > > In the example below, move_pages_pte is moving src_pte to dst_pte, > > where src_pte is a swap entry PTE holding swap entry S1, and S1 > > is not in the swap cache: > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1 > > ... < interrupted> ... > > > > // folio A is a new allocated folio > > // and get installed into src_pte > > > > // src_pte now points to folio A, S1 > > // has swap count =3D=3D 0, it can b= e freed > > // by folio_swap_swap or swap > > // allocator's reclaim. > > > > // folio B is a folio in another VMA= . > > > > // S1 is freed, folio B can use it > > // for swap out with no problem. > > ... > > folio =3D filemap_get_folio(S1) > > // Got folio B here !!! > > ... < interrupted again> ... > > > > // Now S1 is free to be used again. > > > > // Now src_pte is a swap entry PTE > > // holding S1 again. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // Moved invalid folio B here !!! > > > > The race window is very short and requires multiple collisions of > > multiple rare events, so it's very unlikely to happen, but with a > > deliberately constructed reproducer and increased time window, it > > can be reproduced easily. > > Thanks for the fix. > > Please spell out clearly what is the consequence of the race if > triggered. I assume possible data lost? That should be mentioned in > the first few sentences of the commit message as the user's visible > impact. Hi Chris, This commit fixes two kinds of races, they may have different results: Barry reported a BUG_ON in commit c50f8e6053b0, we may see the same BUG_ON if the filemap lookup returned NULL and folio is added to swap cache after that. If another kind of race is triggered (folio changed after lookup) we may see RSS counter is corrupted: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the folio is being accounted to the wrong VMA. I'm not sure if there will be any data corruption though, seems no. The issues above are critical already. > > > > > This can be fixed by checking if the folio returned by filemap is the > > valid swap cache folio after acquiring the folio lock. > > > > Another similar race is possible: filemap_get_folio may return NULL, bu= t > > folio (A) could be swapped in and then swapped out again using the same > > swap entry after the lookup. In such a case, folio (A) may remain in th= e > > swap cache, so it must be moved too: > > > > CPU1 CPU2 > > userfaultfd_move > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > // Here it got entry =3D S1, and S1 is not in swap cache > > folio =3D filemap_get_folio(S1) > > // Got NULL > > ... < interrupted again> ... > > > > > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > // Check passed because src_pte =3D=3D S1 > > folio_move_anon_rmap(...) > > // folio A is ignored !!! > > > > Fix this by checking the swap cache again after acquiring the src_pte > > lock. And to avoid the filemap overhead, we check swap_map directly [2]= . > > > > The SWP_SYNCHRONOUS_IO path does make the problem more complex, but so > > far we don't need to worry about that, since folios can only be exposed > > to the swap cache in the swap out path, and this is covered in this > > patch by checking the swap cache again after acquiring the src_pte lock= . > > > > Testing with a simple C program that allocates and moves several GB of > > memory did not show any observable performance change. > > > > Cc: > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-tqCzi+= EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > Link: https://lore.kernel.org/all/CAGsJ_4yJhJBo16XhiC-nUzSheyX-V3-nFE+t= Ai=3D8Y560K8eT=3DA@mail.gmail.com/ [2] > > Signed-off-by: Kairui Song > > Reviewed-by: Lokesh Gidra > > > > --- > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmai= l.com/ > > Changes: > > - Check swap_map instead of doing a filemap lookup after acquiring the > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh G= idra ] > > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmai= l.com/ > > Changes: > > - Move the folio and swap check inside move_swap_pte to avoid skipping > > the check and potential overhead [ Lokesh Gidra ] > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to date= d > > value. > > > > V3: https://lore.kernel.org/all/20250602181419.20478-1-ryncsn@gmail.com= / > > Changes: > > - Add more comments and more context in commit message. > > > > mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++-- > > 1 file changed, 31 insertions(+), 2 deletions(-) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..8253978ee0fb 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > pte_t orig_dst_pte, pte_t orig_src_pte, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > - struct folio *src_folio) > > + struct folio *src_folio, > > + struct swap_info_struct *si, swp_entry_t entry= ) > > { > > + /* > > + * Check if the folio still belongs to the target swap entry af= ter > > + * acquiring the lock. Folio can be freed in the swap cache whi= le > > + * not locked. > > + */ > > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > > + entry.val !=3D src_folio->swap.val)) > > + return -EAGAIN; > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > @@ -1102,6 +1112,25 @@ static int move_swap_pte(struct mm_struct *mm, s= truct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + /* > > + * Check if the swap entry is cached after acquiring th= e src_pte > > + * lock. Otherwise, we might miss a newly loaded swap c= ache folio. > > + * > > + * Check swap_map directly to minimize overhead, READ_O= NCE is sufficient. > > + * We are trying to catch newly added swap cache, the o= nly possible case is > > + * when a folio is swapped in and out again staying in = swap cache, using the > > + * same entry before the PTE check above. The PTL is ac= quired and released > > + * twice, each time after updating the swap_map's flag.= So holding > > + * the PTL here ensures we see the updated value. False= positive is possible, > > + * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag with= out touching the > > + * cache, or during the tiny synchronization window bet= ween swap cache and > > + * swap_map, but it will be gone very quickly, worst re= sult is retry jitters. > > + */ > > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_H= AS_CACHE) { > > Nit: You can use "} else if {" to save one level of indentation. > > Reviewed-by: Chris Li Thanks!