From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 960BEC5B552 for ; Sat, 31 May 2025 07:11:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0A79C6B017A; Sat, 31 May 2025 03:11:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 054E96B017C; Sat, 31 May 2025 03:11:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E85C56B017E; Sat, 31 May 2025 03:11:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CA3B66B017A for ; Sat, 31 May 2025 03:11:18 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 63FFC81278 for ; Sat, 31 May 2025 07:11:18 +0000 (UTC) X-FDA: 83502331836.25.C3425C0 Received: from mail-vs1-f51.google.com (mail-vs1-f51.google.com [209.85.217.51]) by imf03.hostedemail.com (Postfix) with ESMTP id 79FAC20013 for ; Sat, 31 May 2025 07:11:16 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mFIoXgFL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748675476; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PsVbopQSHGNxCU+521RTaC06XQglnnQiS9YtQjoinYU=; b=Ywej/xewBINkHXVZDjxyVDHDsBoWG+Nkcmwtr1i1ooYjA4EyZQfyL+X7EcPZAG25DsCv9/ vNEIgcWJl+EoM+5DbglE/DPQ1Osqh5fwOEkPQjVA5zJbTRcfFxyn5wn11uPpsH0I/AnSNe czs+FpegAVIyWw9hW61GllY7R0IrCAw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=mFIoXgFL; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748675476; a=rsa-sha256; cv=none; b=Tk4HfpBdCoGkOh93kYEJEVlc1fhFBB79XIyf+keFC24Usj/xjQ+Xh4Jwjqzrn02fFNc0V8 T5+DiD23CZeJM9r6rvIzhluK/2O12kwgfNF8tpZk/c0FkEJXEs6iGqt777hY2mRBqdA6cz bDeL2Hq9+6+PLp/KzqjyHyqV5ZqL23E= Received: by mail-vs1-f51.google.com with SMTP id ada2fe7eead31-4e592443229so833749137.1 for ; Sat, 31 May 2025 00:11:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748675475; x=1749280275; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PsVbopQSHGNxCU+521RTaC06XQglnnQiS9YtQjoinYU=; b=mFIoXgFLJkIYcM4DM996Nz4QsG8LqPIqCLQhr7PBhz0vyh7znj0Gv96b8K8UTknJi5 z5XE6cvIUHve6666pj0SslChhmh8wRHB7EHLsAaRV5BftKEHkKj2BttkvQR3ZFzYRDAx E4afRp2iWAovADaupSh8k6uEFQLATnF8SokgXIi2QLIFo5uLfrC0NGl9q3EP6QEAUtmN 8L8h68srDITM8A6NQ9+VtumPqRRShebwcDQxHURiHzbHry7/gbmrSo1jB2gihIB8m6zP I2sP97FInr/tBXvs/EZZH7igQDrfA3FFSmMssalZfJF2VskhnFkV1iInr6z24W0AMoT/ yZoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748675475; x=1749280275; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PsVbopQSHGNxCU+521RTaC06XQglnnQiS9YtQjoinYU=; b=Npjco47KQrSnO9QQ++jR+ebIsmsNz1hBF9EGyCjKa4b12YZttSdtA2HwlxIUXZMgNF eWEBLEvFaOURb5tDfe4q7NZAdAAtp1X16ns9nsWzCRiYX4cRHgFedyPIOHnZ3v3KMr6M MG1U6n6FF521MgVv6mQJ+27tpss8Ig0sPw/9ZjO1AQEjPF4o2ElslWunLkMhCflPCMJM AscXxOQb7YHVmSdpv2s7vxD9CiQHL3zuV4Fb/efQI1g2YMVEwwogDnz/NYBZC1RXmXeh 3p3iN/pwy8P2Z8cxuPSmWJwQbnLY9pyb/jqtfZgo5biESqxHVxgzMiGkSmdzsRNob1KA WMhQ== X-Forwarded-Encrypted: i=1; AJvYcCUFbGZ8tv2rbSGqy7532SLMpen7HbokRd5HwFbJrku038eUSZm/n1VaakxBG7p+yXpCuXQWlTYNRw==@kvack.org X-Gm-Message-State: AOJu0YyVDu0NKdvY6cKbagrlHqnJpWU9YXpyfxw1rfJ+rYg/yR5bSTbe IZD7KzE3v7CxIay7zmV1/ugYFpMIvS/RypfOJ9W5RDgVgFrG8r9oul+KZp/7jXgJ/RT3v+MYp4A 8El+BWpzeLxqiPMA0P4GkGLDSySZP4tg= X-Gm-Gg: ASbGncuElgNTWgdJdjn0pXLMiP1fij1+yPPUc4xum8bUcSuGPp5hX8x1Cp1KQ/xu29V +EODpYmVQHJheLRIAPBx8I+I34WGlZBhXmRdfRHUulspeZZRpLmSgQD1A8XIIMayUtdf702+pju u6AALJnqI0FapGS0rguznXPVci5upTzxBDcg== X-Google-Smtp-Source: AGHT+IFdt0WJQZOrJQ/qTeXDjQqzXwWZ6VhvBm4MP2zhXRpqKV+ZIDDV1VU9b2qQi+t9GvnVnU3coSTIKJVM5qcGXRk= X-Received: by 2002:a05:6102:6cb:b0:4c1:91da:dac1 with SMTP id ada2fe7eead31-4e6e40f3fd1mr5506320137.6.1748675475505; Sat, 31 May 2025 00:11:15 -0700 (PDT) MIME-Version: 1.0 References: <20250530201710.81365-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Sat, 31 May 2025 19:11:03 +1200 X-Gm-Features: AX0GCFsaMwsDpHvv6Hpgi7mtLb8MSO853OVaDRVrebJRuPjklZMIm8ZvYtyjsWI Message-ID: Subject: Re: [PATCH] mm: userfaultfd: fix race of userfaultfd_move and swap cache To: Kairui Song Cc: Lokesh Gidra , linux-mm@kvack.org, Andrew Morton , Peter Xu , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , stable@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 79FAC20013 X-Stat-Signature: cxn1nermuog7wqo3hdmt5qn44myz5q1s X-Rspam-User: X-HE-Tag: 1748675476-524075 X-HE-Meta: U2FsdGVkX181O6PS+LhFrlbHxtcdK37hFCHh7TlRXhLGM+AoHneUnU269XZ1DwAPXZA7dULb5ZKh9AgaH9S6oeQ1tOo3uNdxVvTKYunEO/MNOlm216h2o3a0REtPnhTotUiCNga11sRfzxO6qCdpS7QG66jKsU3Vsw+Xsqlt5o4KSEG2BGQPh2M4O8WHgszbX/Q/Tqhqld9lprEgzWKol2k1gDmwZKRhNY8j5v0Nw8k6oZZ/Sx5zCb9fgxosydSEgZu61J29J6CAM5nVop7gB2bF9ZcjsaBEsHDNwJ5TNK3XsOWMTb2CUym7jkbYnDvMl6gHDKsABfomBsrV1YgwNw/beSfHfQfECAU+TTfnkOM42FCKyHzWNxjKwAZDYW90K3v632rNNyo1TqwkFEZKXLSZkQDJ+rN495Y5zxV+p6BhAaMl0neID17XOpmR3hguFuW2tcmy/6IiqdpbcEv5QJS5E6wbizgwCS2ZBbysuTqtk0EgaD97UBQifd8vQFnUVNVGKKeTl3uAGRnocCkJwou4so2SrBKdiML9qJOIJv3sT4mSGHP4SVjLNCp/fR5HQ3q0Kv9BFEgwvGNOTfB8C8f2FcKwfyThWCYhU40u23eN7D1cxTpf5DdhGuxwLYPLPDoRvkEyrqXsd/WHgrQRt1D+DXp5F9qGXDV7p0EMK0tBsW4Fb+xtSGwWsSRuPXS9pYHGy3r3XqSJfVPKQbijb+vNMS4JhIUgeKfChn2OvX9D4YCBxX/OWMThy0DviyItAN6kBd0mA9rgLKUY6GBKI8nrayptYZwo+U08EfWzgrmgCqMvGtkd6hGhzhLq/MjIQnfPNflkDJNcIvl3hZUGuhC6ISKJswPvD8eUKfPIMQDswXQiaAVsvMYlMjjA+n08xPBendZZswwpd1akmSVAAmQ453UILk5FGdV3IFQiAtzMQzPCv54IC13qCa5VG5FcoTEfxpH5DkmKUT2GEs4 GcUn+Yy1 94VoXeJSLAQyww8PKobRxy6reIUYOM8uLv6PDS1VvU784w4nn1h/zzpQiI0Flgti82U0myIXgH58x+/uktK6a1Qmwi55n6MV0osW8tKZWC/ImOpqIq2mDx1R9GcqL/kvCKXDdKUXB/old+cozuydpTnPMc5WB7/SXmkBLkKj/Oj4Hy7D0PxO9QlzqLjEdXB5Qf7Z6/PeyQ4EIvDdgu+JYJyWlRjIwFZpN1XCUBd5mZ7TbAh1gr+TlK4grdis2uk61RAWRkeKuXuQWwY/+HAtAKPzSoIi2xOIAHPSxDuxmYgPDgrs1hleU/1qAJQUAacfScG8wjMt0KU+fJsUN9o2uOpyfpyKGCL+nndaNavCUR5YFAjO0tHifCMtrvLtb6e4fpjbC8ZgJi22YNBbRKgFo5KcrRBXTCHLKMJCPEVBCx43IXfZsRzxSDeckkOotDsRdWSCHQjFvGp7/LUw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, May 31, 2025 at 7:06=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, May 31, 2025 at 7:00=E2=80=AFPM Kairui Song wr= ote: > > > > On Sat, May 31, 2025 at 2:36=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sat, May 31, 2025 at 6:25=E2=80=AFPM Kairui Song wrote: > > > > > > > > On Sat, May 31, 2025 at 11:39=E2=80=AFAM Barry Song <21cnbao@gmail.= com> wrote: > > > > > > > > > > On Sat, May 31, 2025 at 11:40=E2=80=AFAM Lokesh Gidra wrote: > > > > > > > > > > > > On Fri, May 30, 2025 at 1:17=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > > > > > From: Kairui Song > > > > > > > > > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless = swap cache > > > > > > > lookup, and try to move the found folio to the faulting vma w= hen. > > > > > > > Currently, it relies on the PTE value check to ensure the mov= ed folio > > > > > > > still belongs to the src swap entry, which turns out is not r= eliable. > > > > > > > > > > > > > > While working and reviewing the swap table series with Barry,= following > > > > > > > existing race is observed and reproduced [1]: > > > > > > > > > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte = is a > > > > > > > swap entry PTE holding swap entry S1, and S1 isn't in the sw= ap cache.) > > > > > > > > > > > > > > CPU1 CPU2 > > > > > > > userfaultfd_move > > > > > > > move_pages_pte() > > > > > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > > > > > // Here it got entry =3D S1 > > > > > > > ... < Somehow interrupted> ... > > > > > > > > > > > > > > // folio A is just a new a= llocated folio > > > > > > > // and get installed into = src_pte > > > > > > > > > > > > > > // src_pte now points to f= olio A, S1 > > > > > > > // has swap count =3D=3D 0= , it can be freed > > > > > > > // by folio_swap_swap or s= wap > > > > > > > // allocator's reclaim. > > > > > > > > > > > > > > // folio B is a folio in a= nother VMA. > > > > > > > > > > > > > > // S1 is freed, folio B co= uld use it > > > > > > > // for swap out with no pr= oblem. > > > > > > > ... > > > > > > > folio =3D filemap_get_folio(S1) > > > > > > > // Got folio B here !!! > > > > > > > ... < Somehow interrupted again> ... > > > > > > > > > > > > > > // Now S1 is free to be us= ed again. > > > > > > > > > > > > > > // Now src_pte is a swap e= ntry pte > > > > > > > // holding S1 again. > > > > > > > folio_trylock(folio) > > > > > > > move_swap_pte > > > > > > > double_pt_lock > > > > > > > is_pte_pages_stable > > > > > > > // Check passed because src_pte =3D=3D S1 > > > > > > > folio_move_anon_rmap(...) > > > > > > > // Moved invalid folio B here !!! > > > > > > > > > > > > > > The race window is very short and requires multiple collision= s of > > > > > > > multiple rare events, so it's very unlikely to happen, but wi= th a > > > > > > > deliberately constructed reproducer and increased time window= , it can be > > > > > > > reproduced [1]. > > > > > > > > > > > > Thanks for catching and fixing this. Just to clarify a few thin= gs > > > > > > about your reproducer: > > > > > > 1. Is it necessary for the 'race' mapping to be MAP_SHARED, or > > > > > > MAP_PRIVATE will work as well? > > > > > > 2. You mentioned that the 'current dir is on a block device'. A= re you > > > > > > indicating that if we are using zram for swap then it doesn't > > > > > > reproduce? > > > > > > > > > > > > > > > > > > > > It's also possible that folio (A) is swapped in, and swapped = out again > > > > > > > after the filemap_get_folio lookup, in such case folio (A) ma= y stay in > > > > > > > swap cache so it needs to be moved too. In this case we shoul= d also try > > > > > > > again so kernel won't miss a folio move. > > > > > > > > > > > > > > Fix this by checking if the folio is the valid swap cache fol= io after > > > > > > > acquiring the folio lock, and checking the swap cache again a= fter > > > > > > > acquiring the src_pte lock. > > > > > > > > > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, b= ut so far > > > > > > > we don't need to worry about that since folios only might get= exposed to > > > > > > > swap cache in the swap out path, and it's covered in this pat= ch too by > > > > > > > checking the swap cache again after acquiring src_pte lock. > > > > > > > > > > > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=3D6OOrK2O= UZ0-tqCzi+EJt+2_K97TPGoSt=3D9+JwP7Q@mail.gmail.com/ [1] > > > > > > > Signed-off-by: Kairui Song > > > > > > > --- > > > > > > > mm/userfaultfd.c | 26 ++++++++++++++++++++++++++ > > > > > > > 1 file changed, 26 insertions(+) > > > > > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > > > index bc473ad21202..a1564d205dfb 100644 > > > > > > > --- a/mm/userfaultfd.c > > > > > > > +++ b/mm/userfaultfd.c > > > > > > > @@ -15,6 +15,7 @@ > > > > > > > #include > > > > > > > #include > > > > > > > #include > > > > > > > +#include > > > > > > I guess you mistakenly left it from your reproducer code :) > > > > > > > #include > > > > > > > #include > > > > > > > #include "internal.h" > > > > > > > @@ -1086,6 +1087,8 @@ static int move_swap_pte(struct mm_stru= ct *mm, struct vm_area_struct *dst_vma, > > > > > > > spinlock_t *dst_ptl, spinlock_t *src= _ptl, > > > > > > > struct folio *src_folio) > > > > > > > { > > > > > > > + swp_entry_t entry; > > > > > > > + > > > > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_p= te, orig_src_pte, > > > > > > > @@ -1102,6 +1105,19 @@ static int move_swap_pte(struct mm_str= uct *mm, struct vm_area_struct *dst_vma, > > > > > > > if (src_folio) { > > > > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > > > > src_folio->index =3D linear_page_index(dst_vm= a, dst_addr); > > > > > > > + } else { > > > > > > > + /* > > > > > > > + * Check again after acquiring the src_pte lo= ck. Or we might > > > > > > > + * miss a new loaded swap cache folio. > > > > > > > + */ > > > > > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > > > > > + src_folio =3D filemap_get_folio(swap_address_= space(entry), > > > > > > > + swap_cache_inde= x(entry)); > > > > > > > > > > > > Given the non-trivial overhead of filemap_get_folio(), do you t= hink it > > > > > > will work if filemap_get_filio() was only once after locking sr= c_ptl? > > > > > > Please correct me if my assumption about the overhead is wrong. > > > > > > > > > > not quite sure as we have a folio_lock(src_folio) before move_swa= p_pte(). > > > > > can we safely folio_move_anon_rmap + src_folio->index while not h= olding > > > > > folio lock? > > > > > > > > I think no, we can't even make sure the folio is still in the swap > > > > cache, so it can be a freed folio that does not belong to any VMA > > > > while not holding the folio lock. > > > > > > Right, but will the following be sufficient, given that we don=E2=80= =99t really > > > care about the folio=E2=80=94only whether there=E2=80=99s new cache? > > > > > > if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) { > > > double_pt_unlock(dst_ptl, src_ptl); > > > return -EAGAIN; > > > } > > > > The problem is reading swap_map without locking the cluster map seems > > unstable, and has strange false positives, a swapin will set this bit > > first, while not adding the folio to swap cache or even when skipping > > the swap cache, that seems could make it more complex. > > As long as it's a false positive and not a false negative, I think it's > acceptable=E2=80=94especially if we're concerned about the overhead of > filemap_get_folio. The probability is extremely low (practically close > to 0%), but we still need to call filemap_get_folio for every swap PTE. By the way, if we do want to eliminate the false positives, we could do the following: if (READ_ONCE(si->swap_map[offset]) & SWAP_HAS_CACHE) { folio =3D filemap_get_folio() if (folio....) { double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } } This might eliminate 99.999999...% of filemap_get_folio calls, though my gut feeling is that it might not be necessary :-) Thanks Barry