From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F9EAC54ED0 for ; Fri, 23 May 2025 02:30:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DE516B0082; Thu, 22 May 2025 22:30:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98EEB6B0083; Thu, 22 May 2025 22:30:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 87DC96B0085; Thu, 22 May 2025 22:30:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 67B8C6B0082 for ; Thu, 22 May 2025 22:30:09 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 60B9681195 for ; Fri, 23 May 2025 02:30:08 +0000 (UTC) X-FDA: 83472592896.03.8F8E970 Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41]) by imf22.hostedemail.com (Postfix) with ESMTP id 69B06C0002 for ; Fri, 23 May 2025 02:30:06 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kSlUvQpe; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747967406; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6GhPFFEZ/yaEnJzF50m9SSS7QeXDm9lwGqgrysWtB0w=; b=vbrn0UXegRF++jajdQi5dfUpqk5i+UCjkbwsnIzsmA9VXbQSQQk4MGIBgX2RgXkQx3ZXw9 5Xw9cVappnYRYFHnFt9n2MxoAIIQkY8GzIbe2dP2T5ZXRuaR//d9ujAyh14tzeKN5Jnwxd MwuBNgMaURwIEWNA+l851vSV5l9vmxQ= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kSlUvQpe; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf22.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747967406; a=rsa-sha256; cv=none; b=RcUokGldGmtgpq/kwtGYybsaf4YW9oinHLqiuV9+HY5dK1qYK0uMa8RGC0EaPVCYY1gGrs jm17wJJeXTHD0QL9OM1nyU03Bb66QFvhm1a+oEBYqxOJk2f9x3kkDNc183eL2Anyb/d4ST KaBRGcp4tliauFZnnhhE0liMqr2sNqk= Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-86d5e3ddb66so2216721241.2 for ; Thu, 22 May 2025 19:30:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747967405; x=1748572205; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6GhPFFEZ/yaEnJzF50m9SSS7QeXDm9lwGqgrysWtB0w=; b=kSlUvQped8nSEUON0yMqnmk4DIe5Wx6M5BoXDopUIgAXZ2GmgSJQa7TmsCWgpu0w9S wvoxgYbzCNBV5RzABOlUysn2qvCHfPIXQQRvMkCCpg0EdCSJ1DejAqaa6LZnO/5afCib pjJVtGTPwtlewKXzGsOYcaY8EvI2WLLA/k9gMocyKJm5+4ZledBnAtXpvufPRb+9JQdE OMdpgej60I8NB+tiTu01THdhjDuhe3daJMu4L9ttqaAvuJnJFcnCt6xasU0obR8579Nf tr3wjSw1ashTCQZCJLbwmkYXSNIkuUd/Nn595wLj/q7vP/PM3Jl1pNvX0tVgSkWkWLTy qKdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747967405; x=1748572205; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6GhPFFEZ/yaEnJzF50m9SSS7QeXDm9lwGqgrysWtB0w=; b=sZDnRJxK0lv7plzCOqqmqYsf2NckdAI7XgQZx/Nikp+XauX41RBr7OIOYzfXuv+qDm HtNKSUW9VQkySE2t1do/TB4c8C5OZA41c+5KYKCZk8LAZBT06Kk3JYsya504yJF5d5RI bb7LPQmDFLNlMh7gnh7OtrnGYoNtatFPQICTfBdV3ZIN3icZ+aW8VHwghMTpteF+v6IZ L7N8LErLImBp3HyTH64GryvU6GJMo01uAvMvDEY4qGm4VQJewpH6UtJraIBmMMTiloKj mLDf3p4IgXQBje637OIE/gCCOB6tMQgyf3VAsPpj1+zmATYeGNFIRyHNDc3otYw5Hz5g A0dg== X-Forwarded-Encrypted: i=1; AJvYcCUyCPk7dlpXzeHO6Or1yDU6B4gyt451sGBMhyCPIOxFXg5z810v5RpZu8ulGon5OXiW8EJ3LdmSNA==@kvack.org X-Gm-Message-State: AOJu0YzYwNzc/tm9xKuoC8f5xYoUIqRLR9p7Fay4Ro8Bff6lYTE0a6tf MLbS4wVUwxYdTmqPchPjQ4zF3M2meiBgHxuSlw+YAx+9i1TTUUO+ywJ7iOEtVMbwpHp6x8w4+T3 TbR+wiQjmAMpr/zZwls1beFskP56yn+A= X-Gm-Gg: ASbGncu+fc/Q/50+2/22yAAr0KkvOtNPl12gH9iMl1r4hG2xrK+tRGpbYFvzRHyul3m Csbm4g0coPUibIzPaLbZS4k8wNB1Nt8Zj/xYxbZVAFA5jNcUGbnSFA0vLfSLM5dhQkpV0ezs2r6 9Ht9UCYRb24qgMLkaPPeHlanyRfZ+d9wKYSNBHOJAgdZkL X-Google-Smtp-Source: AGHT+IEq9gPg7TBQT68PBI1eRbqBOw3tolqXf12NdOLUbkQetSZohc5pynUajZiWXRwI9Qo3SZ74bG5Z/HkZJDs037A= X-Received: by 2002:a05:6102:a4a:b0:4da:70a8:73cf with SMTP id ada2fe7eead31-4e2f1a4be63mr1059238137.20.1747967405284; Thu, 22 May 2025 19:30:05 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 23 May 2025 14:29:54 +1200 X-Gm-Features: AX0GCFuYydVoYFYrBEc_fPYS_t8o99TqaZBkws0AaEyYaycI10vhz4vMxTOmyy8 Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Kairui Song Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 69B06C0002 X-Stat-Signature: 4wpuara531akbcyb5c6skz6gyf4r4ccr X-Rspam-User: X-HE-Tag: 1747967406-577016 X-HE-Meta: U2FsdGVkX18rnGkhSdle4LYtFH3oTFLucF/4djPK+ORSDOSwYpUsuqDEG5JQjzsti6r2dPGKYggjrJFES7yLiZ6AoSORmE/Vv16YUcMxQz5bJV8KHlPdJxWVh4HB9jDutHXCszvNoLCD5dAQK+LIvciFl3sc1Y0MbfI9DayJLuyEuwENFbIuSJpoqL836aACED/f6BtP2C9KC9m1DoQn7DLJR0ekuKsYrAcpar4oA1F6IWhiF46s7DAHXS6CgsRfvAGJfPxb+OR9p40XWnZKesYNyAOUAZSpCRk0xqpWfqli+UWe4AMZYwZ+mg4NX1eEV0zje+nyzlmPXVs4adUatwzD2gEKsRkl6c4hWbdKs69I26dgORQATMAk5dzxVjFR/OcwvcwOZQBpdCb2ftOINGHg4WLVA/wqoaIGWsYEuE/WfGBdweBxByEGEt2r8Hnvx9Mfq6b8z1DomCPEucXxZOaWr1lBp0bB+30nJ6ND1q+jsjfqMk37pkqX4jbiPTHF4+l9ptzXPYby5wqIZ0cxnaeSOJAhlqEr9/dQP/9GNQob7+mMPsZaJgU8tBDY0DdhgfOIbZoXyFOZZFGY4IG9qF7oMYXbIzXShOxp7Gc8UHZFGI/01sw69t7Yk454v9pj9UjmLkWBqtcd/Dpmvj8dPDL1hjvuUge3lZnXD5xBBfZQYp6/DpBnuO25/ZOs4S31GTRV2htR0I+IVB7TsGviIng5QDm8iD7SzyiMiDLpP1IytJFv5K63j8qdyjWBEvg+p1MwshmuI7LCsJw7OzzJod0Rxtb+enzs4OIkXTdeMmhHNMalsU7kS5EAFer5vGVj9pYKJjapnBMPAIaIJLOlNjZoXZIOmbuyqMQHmxzQTGz8fVzPD8BARz9ncByrD7WtB0tpQnUW1tHdtPmtBtVDWCYAdayMBJZCOlzcng8o9c6C9N/4TEEz394pvPo7LnwJwzaHIeBJ7e5bWOESiw8 NgtWVhrq TTLSoRt6qXDJcsb/AnUsoFVx9Vs4pfF8iIsD4SuXuWQyV4TO+FuFPC+HrS0sGzY5w5Z3qnoIPkW3Mm6CZ3UOQJdPA1keKOg6199tl76DxYcPk5w6NvXzy2jDXL0v7WTD/yUND79BmZKJwZ2IwTbXvOrHjgmv8lVoZcb2aC6lviWycjX82/FoGhl+/0/nAiIevIu82WWJzNkbjf+7e1W+Nfei22GVoDJJQYRRRf6Tn//zTkibf1sXlltTmPfQSu+ZuGf4JZ7L/HyauHD+xL/+jWAzu9yXL2D0b6N7emxU9VeZ7mS53YMUDh6IYshV8tQQxd/tFeuhX0BE43vrO6YkfjSwwjYj+KHodJtdckbbjLESSWMu0ou+sxitMb5TbXlIVHU+RT1c2JzYggAI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song wrot= e: > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821=E6=97= =A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > On Wed, May 21, 2025 at 7:10=E2=80=AFAM Kairui Song = wrote: > > > > > > On Tue, May 20, 2025 at 12:41=E2=80=AFPM Barry Song <21cnbao@gmail.co= m> wrote: > > > > > > > > On Tue, May 20, 2025 at 3:31=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > On Mon, May 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@gmai= l.com> wrote: > > > > > > > > > > > > > From: Kairui Song > > > > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644 > > > > > > > --- a/mm/userfaultfd.c > > > > > > > +++ b/mm/userfaultfd.c > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_st= ruct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > > > goto retry; > > > > > > > } > > > > > > > } > > > > > > > + if (!folio_swap_contains(src_folio, entry)) { > > > > > > > + err =3D -EBUSY; > > > > > > > + goto out; > > > > > > > + } > > > > > > > > > > > > It seems we don't need this. In move_swap_pte(), we have been c= hecking pte pages > > > > > > are stable: > > > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte= , orig_src_pte, > > > > > > dst_pmd, dst_pmdval)) { > > > > > > double_pt_unlock(dst_ptl, src_ptl); > > > > > > return -EAGAIN; > > > > > > } > > > > > > > > > > The tricky part is when swap_cache_get_folio returns the folio, b= oth > > > > > folio and ptes are unlocked. So is it possible that someone else > > > > > swapped in the entries, then swapped them out again using the sam= e > > > > > entries? > > > > > > > > > > The folio will be different here but PTEs are still the same valu= e to > > > > > they will pass the is_pte_pages_stable check, we previously saw > > > > > similar races with anon fault or shmem. I think more strict check= ing > > > > > won't hurt here. > > > > > > > > This doesn't seem to be the same case as the one you fixed in > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that > > > > case, there was no one hitting the swap cache, and you used > > > > swap_prepare() to set up the cache to fix the issue. > > > > > > > > By the way, if we're not hitting the swap cache, src_folio will be > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) doe= s > > > > not guard against that case either. > > > > > > Ah, that's true, it should be moved inside the if (folio) {...} block > > > above. Thanks for catching this! > > > > > > > But I suspect we won't have a problem, since we're not swapping in = =E2=80=94 > > > > we didn't read any stale data, right? Swap-in will only occur after= we > > > > move the PTEs. > > > > > > My concern is that a parallel swapin / swapout could result in the > > > folio to be a completely irrelevant or invalid folio. > > > > > > It's not about the dst, but in the move src side, something like: > > > > > > CPU1 CPU2 > > > move_pages_pte > > > folio =3D swap_cache_get_folio(...) > > > | Got folio A here > > > move_swap_pte > > > > > > > > > | Now folio A is no longer valid. > > > | It's very unlikely but here SWAP > > > | could reuse the same entry as ab= ove. > > > > > > swap_cache_get_folio() does increment the folio's refcount, but it seem= s this > > doesn't prevent do_swap_page() from freeing the swap entry after swappi= ng > > in src_pte with folio A, if it's a read fault. > > for write fault, folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(foli= o)) > > will be false: > > > > static inline bool should_try_to_free_swap(struct folio *folio, > > struct vm_area_struct *vma, > > unsigned int fault_flags) > > { > > ... > > > > /* > > * If we want to map a page that's in the swapcache writable, w= e > > * have to detect via the refcount if we're really the exclusiv= e > > * user. Try freeing the swapcache to get rid of the swapcache > > * reference only in case it's likely that we'll be the exlusiv= e user. > > */ > > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(foli= o) && > > folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio= )); > > } > > > > and for swapout, __removing_mapping does check refcount as well: > > > > static int __remove_mapping(struct address_space *mapping, struct folio= *folio, > > bool reclaimed, struct mem_cgroup *target_m= emcg) > > { > > refcount =3D 1 + folio_nr_pages(folio); > > if (!folio_ref_freeze(folio, refcount)) > > goto cannot_free; > > > > } > > > > However, since __remove_mapping() occurs after pageout(), it seems > > this also doesn't prevent swapout from allocating a new swap entry to > > fill src_pte. > > > > It seems your concern is valid=E2=80=94unless I'm missing something. > > Do you have a reproducer? If so, this will likely need a separate fix > > patch rather than being hidden in this patchset. > > Thanks for the analysis. I don't have a reproducer yet, I did some > local experiments and that seems possible, but the race window is so > tiny and it's very difficult to make the swap entry reuse to collide > with that, I'll try more but in theory this seems possible, or at > least looks very fragile. > > And yeah, let's patch the kernel first if that's a real issue. > > > > > > double_pt_lock > > > is_pte_pages_stable > > > | Passed because of entry reuse. > > > folio_move_anon_rmap(...) > > > | Moved invalid folio A. > > > > > > And could it be possible that the swap_cache_get_folio returns NULL > > > here, but later right before the double_pt_lock, a folio is added to > > > swap cache? Maybe we better check the swap cache after clear and > > > releasing dst lock, but before releasing src lock? > > > > It seems you're suggesting that a parallel swap-in allocates and adds > > a folio to the swap cache, but the PTE has not yet been updated from > > a swap entry to a present mapping? > > > > As long as do_swap_page() adds the folio to the swap cache > > before updating the PTE to present, this scenario seems possible. > > Yes, that's two kinds of problems here. I suspected there could be an > ABA problem while working on the series, but wasn't certain. And just > realised there could be another missed cache read here thanks to your > review and discussion :) > > > > > It seems we need to double-check: > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..976053bd2bf1 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm, > > struct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + struct folio *folio =3D > > filemap_get_folio(swap_address_space(entry), > > + swap_cache_index(entry)); > > + if (!IS_ERR_OR_NULL(folio)) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > } > > - > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > #ifdef CONFIG_MEM_SOFT_DIRTY > > orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte); > > Maybe it has to get even dirtier here to call swapcache_prepare too to > cover the SYNC_IO case? > > > > > Let me run test case [1] to check whether this ever happens. I guess I = need to > > hack kernel a bit to always add folio to swapcache even for SYNC IO. > > That will cause quite a performance regression I think. Good thing is, > that's exactly the problem this series is solving by dropping the SYNC > IO swapin path and never bypassing the swap cache, while improving the > performance, eliminating things like this. One more reason to justify > the approach :) I attempted to reproduce the scenario where a folio is added to the swapcac= he after filemap_get_folio() returns NULL but before move_swap_pte() moves the swap PTE using non-synchronized I/O. Technically, this seems possible; however, I was unable to reproduce it, likely because the time window between swapin_readahead an= d taking the page table lock within do_swap_page() is too short. Upon reconsideration, even if this situation occurs, it is not an issue bec= ause move_swap_pte() obtains both the source and destination page table locks, and *clears* the source PTE. Thus, when do_swap_page() subsequently acquire= s the source page table lock for src, it cannot map the new swapcache folio to the PTE since pte_same will return false. > > > > > [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gma= il.com/ > > I'll try this too. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, -EBUSY is somehow incorrect error code. > > > > > > > > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte. > > > > > > > > > > > > > > > > > > > > > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, sr= c_addr, dst_pte, src_pte, > > > > > > > orig_dst_pte, orig_src_pte, dst= _pmd, dst_pmdval, > > > > > > > dst_ptl, src_ptl, src_folio); > > > > > > > > > > > > > > > > > > > Thanks Barry