From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A289EC5AE59 for ; Sat, 31 May 2025 07:00:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 216C56B016C; Sat, 31 May 2025 03:00:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 19FE66B0170; Sat, 31 May 2025 03:00:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 069136B0171; Sat, 31 May 2025 03:00:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DA9E66B016C for ; Sat, 31 May 2025 03:00:53 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7AC62141185 for ; Sat, 31 May 2025 07:00:53 +0000 (UTC) X-FDA: 83502305586.04.552DC69 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf22.hostedemail.com (Postfix) with ESMTP id 80013C000F for ; Sat, 31 May 2025 07:00:51 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hjUqON9b; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748674851; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=O0/6i247A0yd7g/ohON1pSpOBinIfrLgbq10zNbzL4E=; b=nlSsMsNTq4/kJF9+ovf5vQJoZJ341ZFrjQnbNPVr4fqVVqV/jg+mrUjco7kPfavEyLwE7t CKTCp6GhxfHSX5yVltC2x7nUaXuaqgRz2m5n9v7GN/GhV1ghbbORdeoiJGPQw6DgXa2ZjF CFUgDJsu2JbwBnLmGFJje/PwDvAEgpY= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hjUqON9b; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748674851; a=rsa-sha256; cv=none; b=uC0sM7Qwg4Z8IO3jRymOzbFoZN/3ebjJXP1LPmlnz5xSlLXLYZhiYU10GkeIdAnj/uTvY5 sd1G1aV8SOOAUghUuC0xGlTBm/NpWpkL62f+B8L+NC7ZAanqV5z3IIBn9VpAIuu1+m5ocg dbJcwVrC0fUkM2/s9gyx/+8RWXcvOOY= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-310447fe59aso27836911fa.0 for ; Sat, 31 May 2025 00:00:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748674850; x=1749279650; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=O0/6i247A0yd7g/ohON1pSpOBinIfrLgbq10zNbzL4E=; b=hjUqON9bqhG/isY9HyziFYlthVafVVxfYVWb5T/L4uCCEttZOUFBILvzdTk+G7PeoG UqDuDam4655WnBdL51K54u6zSdtDp25yonVGXbQcTBQ3PkPMuXA8e23DZNH5/vHiGj3i Hh8WfQJ+FJcEJ06ac8o6H+MwiEVYEnz5TJXpocP/VINXjhG/0wuwXWR3Z3CXH09pL4hg sWXMg41MP83FhSdweCUgin2kNI55huIPZ6t3/lVlkebUm6KKZjtDR3v5HvozBBCLCjcP ZAKL3uzid1HLg/JgpOpIrlJ2mX+IlXZvAuoxrWThqt9LROTsqkAOJDvMSrQgRxPu3ciA JJzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748674850; x=1749279650; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=O0/6i247A0yd7g/ohON1pSpOBinIfrLgbq10zNbzL4E=; b=gXW+SijZvTrI+uOW4grN9FF22r7H6+4UN0+gasGw4qS53lofGOLT9P4wBH3vcBHKDO DJkrGBAx67/NXVpT7d2kpC2spSkn6Zkyz1dOecSdn+2qOa8lQ9HBV1CM9u0mKfKfjsqE EH1xGipRx77AsqyjyYF6gvxAkQ47/fF7Kd5BrIN1qQomKFXy0TRZAGOgD7JKkM+nFOT3 +FBwmW0a9kBlbN+dSRnoeYwIDil8XqldEZ2rKu+CicYP0dvicZQU98agrs/zH0UjH616 DbVF7pOw2aFLlgdMJ2tZvZH9dmROlNeirrS3lfnEECfzbBTY1pLSF1m/0GASeqdi/cMv ySqg== X-Forwarded-Encrypted: i=1; AJvYcCUo6c+XHlzaoCcm1XIR0XvZ6F4xb6zIEPuCcoNu/ao58G/pfhd3o2gVR/pbm69/2ILFRT6aeQnNfQ==@kvack.org X-Gm-Message-State: AOJu0Yy8jV6P1wovhph6iziy9+T4WjzrvSl+ATbE7qp1Q1xEtcaP4IOk GRRie+eB6p0q3JN9VJIex/YoatPsojdUeSZvHBax0Zk2blaBavy/YZQX49JrFw21kS9b7A5bZNQ AeE4Nl+zhmMK6PmHYtUszVUB6IGMhHrg= X-Gm-Gg: ASbGncv9HzCcpPzVas903ajCjSMl6s3dUUJmLC/LFiQgtqsTnDA87Qos3vaeARM0mrp 3vti6+wVL410HGc40ThIsLsfEAmH0r+URE26UU7WBszzK//0U+JO8uP9bepUM8e049JLWCyPC8K 6lllwWbRWV8HYneuAGhYlNAg06SEK1kszcJeqz+nBIhxs= X-Google-Smtp-Source: AGHT+IFQfYTOUKoUQsoKvjrVG8EjikBueS9mdC3qCctGVd586bN2S/4pCQTvJwmEnsD58/V2Q3nlHbLroDe2+EGI8A8= X-Received: by 2002:a05:651c:b1f:b0:32a:8bf4:3a54 with SMTP id 38308e7fff4ca-32a9e99b639mr3723201fa.2.1748674849519; Sat, 31 May 2025 00:00:49 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Sat, 31 May 2025 15:00:31 +0800 X-Gm-Features: AX0GCFuLqGZrssa2ETQMnuocMxOiKaSYx_nnjXGgLT7jaHa6QNo6cy-ULN9-bYc Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Barry Song <21cnbao@gmail.com> Cc: Lokesh Gidra , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 80013C000F X-Stat-Signature: joc1nd9o8ndwmgwc8aumysz8fnxapqrw X-Rspam-User: X-HE-Tag: 1748674851-184316 X-HE-Meta: U2FsdGVkX19TmbznGAQs5fd/oJrk5w3rOGdBc/JCtwUeqX89VI/MfmUhs5WnMwuQBoyFmOhJEEFQ5GM+NVC8Ge96xDSUbiWk9uuZGFTBzGHCAFvNenl9v5WChmLCuYZI1+i/SzVN4MHMm11il2rTQHSIicRYNwYvPT6kZEruKuZZeS4JQRyB4+uGG25zuHB7Pze3IBp+0vuiwQaTq9MTauYlKYpgHfgA3TI4xMaeynsq/+6RpHehvXlzUyhlE886oPS//+fcoa1TnuiZ7kdQho6+x99v+Z2rt/L3jLJgXeUICx6Zfwj/3VrAFZR/xYA8Tb31vCpJE1flBzSc5MRQe4lZ7kFJ3jpL3bocG1BSfYDJjlbu6O9WR6MM1SKUIMVi9wS4P492tyA1MxM2IhekgjB5C3BI3mFddBXj0pdXQ55H9lteoQ/3dcOkFNLkCEnSQpMq7UK9nsyNcxZ2lWWk1bR3ilrSp4/P4vLE3VqBHId8iyVmmCuTHPlREJACkq4BfdzpCsJxnMJAjDt1etukYmVgfwMS2RJJF8RQ6Nl/8IeuXBYE2p/cSHjlilJMXagkVbcz0R4y+5dKIQ5/7zvx1q63gSDp2lO9GgpGKiHOImGIPdGTzo895HyTx29gobe5yRsITzxMdwLvvhxTON+V3z3oFf3HfFrNiNPJpHXIQumVPZgvSsgsapQxHJBCNkul8Nkl2I33rLZjhhGJ1F38+6ItbmuWE0pNZCV4iGN1kC0KUar0Figt+mtrGGLm9l+BjlqFm68kfHjM5KOGKaIU+MjfJvNKGOJp5QQmRE2arNHYZkWVLMFrQxRdsWN43jI+3Sd1qDHPSfW97vaRXsh/bXy/jii1xGAiDDUFWtAaEXD0Z+MHQTg6WMdSmG4PV645948cYkUCEx8GtqUXyqunLxSCyBJAQSTuJ4nPpUlUAj4yPIJqm9gR7MFKiR2Ik1kthxArbBzhXOGuBpjKp7i 4LJNKMH7 F6wxihIY97ou+tVmlek+edxC0QPsoqAAvSVrGGu6dS644oMHcJiope419nWuJH/bT4JWLzvmYa6e0EZGmSSccsxb7wMO75LOzXUIVy+6NqD0KVBqpngsRupTDJShhbHx8VaUrU95GQKquXPnY1CuRKvdKFt1jowuYBca39Ryi1dY1f/FxCcpuO4hbEZfuKsH4VYVCYDdv7H+OGLwR3pA77LXxa1sd/ByrZNot8LBbrv5Lr2xsCWgPdXoC7/rF9MZ/6dKzb6+HfncImYrCdFJvKRfH/0aIb18eBcpK1CJ4Xmh9mCE2/42xIFfshokM+D/+XkK4S2RDuc3dik3LePw7NTbTrbJtR7KDVjNEgz1LzzFzIrm3XhHdxQL5ymHSWsx4Pz4VeGQ7LFXFsxj6F97OBQwuXBPc7CW/cHsf2cAd736o6OJhxEubKT2Dgl9Tc3HI0UZLOGqFH52GACE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 2:36=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, May 31, 2025 at 6:25=E2=80=AFPM Kairui Song wr= ote: > > > > On Sat, May 31, 2025 at 11:39=E2=80=AFAM Barry Song <21cnbao@gmail.com>= wrote: > > > > > > On Sat, May 31, 2025 at 11:40=E2=80=AFAM Lokesh Gidra wrote: > > > > > > > > On Fri, May 30, 2025 at 1:17=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > From: Kairui Song > > > > > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap= cache > > > > > lookup, and try to move the found folio to the faulting vma when. > > > > > Currently, it relies on the PTE value check to ensure the moved f= olio > > > > > still belongs to the src swap entry, which turns out is not relia= ble. > > > > > > > > > > While working and reviewing the swap table series with Barry, fol= lowing > > > > > existing race is observed and reproduced [1]: > > > > > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap c= ache.) > > > > > > > > > > CPU1 CPU2 > > > > > userfaultfd_move > > > > > move_pages_pte() > > > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > > > // Here it got entry =3D S1 > > > > > ... < Somehow interrupted> ... > > > > > > > > > > // folio A is just a new alloc= ated folio > > > > > // and get installed into src_= pte > > > > > > > > > > // src_pte now points to folio= A, S1 > > > > > // has swap count =3D=3D 0, it= can be freed > > > > > // by folio_swap_swap or swap > > > > > // allocator's reclaim. > > > > > > > > > > // folio B is a folio in anoth= er VMA. > > > > > > > > > > // S1 is freed, folio B could = use it > > > > > // for swap out with no proble= m. > > > > > ... > > > > > folio =3D filemap_get_folio(S1) > > > > > // Got folio B here !!! > > > > > ... < Somehow interrupted again> ... > > > > > > > > > > // Now S1 is free to be used a= gain. > > > > > > > > > > // Now src_pte is a swap entry= pte > > > > > // holding S1 again. > > > > > folio_trylock(folio) > > > > > move_swap_pte > > > > > double_pt_lock > > > > > is_pte_pages_stable > > > > > // Check passed because src_pte =3D=3D S1 > > > > > folio_move_anon_rmap(...) > > > > > // Moved invalid folio B here !!! > > > > > > > > > > The race window is very short and requires multiple collisions of > > > > > multiple rare events, so it's very unlikely to happen, but with a > > > > > deliberately constructed reproducer and increased time window, it= can be > > > > > reproduced [1]. > > > > > > > > Thanks for catching and fixing this. Just to clarify a few things > > > > about your reproducer: > > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or > > > > MAP_PRIVATE will work as well? > > > > 2. You mentioned that the 'current dir is on a block device'. Are y= ou > > > > indicating that if we are using zram for swap then it doesn't > > > > reproduce? > > > > > > > > > > > > > > It's also possible that folio (A) is swapped in, and swapped out = again > > > > > after the filemap_get_folio lookup, in such case folio (A) may st= ay in > > > > > swap cache so it needs to be moved too. In this case we should al= so try > > > > > again so kernel won't miss a folio move. > > > > > > > > > > Fix this by checking if the folio is the valid swap cache folio a= fter > > > > > acquiring the folio lock, and checking the swap cache again after > > > > > acquiring the src_pte lock. > > > > > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but s= o far > > > > > we don't need to worry about that since folios only might get exp= osed to > > > > > swap cache in the swap out path, and it's covered in this patch t= oo by > > > > > checking the swap cache again after acquiring src_pte lock. > > > > > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2OUZ0-= tqCzi+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > > > Signed-off-by: Kairui Song > > > > > --- > > > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > > > 1 file changed, 26 insertions(+) > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index bc473ad21202..a1564d205dfb 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -15,6 +15,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > I guess you mistakenly left it from your reproducer code :) > > > > > #include > > > > > #include > > > > > #include "internal.h" > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_struct *= mm, struct vm_area_struct *dst_vma, > > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl= , > > > > > struct folio *src_folio) > > > > > { > > > > > + swp_entry_t entry; > > > > > + > > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, = orig_src_pte, > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_struct = *mm, struct vm_area_struct *dst_vma, > > > > > if (src_folio) { > > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > > src_folio->index =3D linear_page_index(dst_vma, d= st_addr); > > > > > + } else { > > > > > + /* > > > > > + * Check again after acquiring the src_pte lock. = Or we might > > > > > + * miss a new loaded swap cache folio. > > > > > + */ > > > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > > > + src_folio =3D filemap_get_folio(swap_address_spac= e(entry), > > > > > + swap_cache_index(en= try)); > > > > > > > > Given the non-trivial overhead of filemap_get_folio(), do you think= it > > > > will work if filemap_get_filio() was only once after locking src_pt= l? > > > > Please correct me if my assumption about the overhead is wrong. > > > > > > not quite sure as we have a folio_lock(src_folio) before move_swap_pt= e(). > > > can we safely folio_move_anon_rmap + src_folio->index while not holdi= ng > > > folio lock? > > > > I think no, we can't even make sure the folio is still in the swap > > cache, so it can be a freed folio that does not belong to any VMA > > while not holding the folio lock. > > Right, but will the following be sufficient, given that we don=E2=80=99t = really > care about the folio=E2=80=94only whether there=E2=80=99s new cache? > > if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) { > double_pt_unlock(dst_ptl, src_ptl); > return -EAGAIN; > } The problem is reading swap_map without locking the cluster map seems unstable, and has strange false positives, a swapin will set this bit first, while not adding the folio to swap cache or even when skipping the swap cache, that seems could make it more complex.