From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 90427C5AE59 for ; Tue, 3 Jun 2025 16:42:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E6FCD6B04BF; Tue, 3 Jun 2025 12:42:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E204E6B04C0; Tue, 3 Jun 2025 12:42:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D383F6B04C1; Tue, 3 Jun 2025 12:42:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AE0BA6B04BF for ; Tue, 3 Jun 2025 12:42:10 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 72C7516100D for ; Tue, 3 Jun 2025 16:42:10 +0000 (UTC) X-FDA: 83514656820.03.BBAB3FB Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf28.hostedemail.com (Postfix) with ESMTP id E4BA3C0008 for ; Tue, 3 Jun 2025 16:42:07 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CTGCW58v; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748968928; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XW5iA+IcoMvqOf0ndcrgPMgu4TWbTOSU4Ckyq0Vi+RE=; b=uLnVSC0Mn6SlYH+X4ClQUZC5LwZeT0UjQ2jUddIdUTZsYdNiwTcr3AH7i0VFwcVH8CuRxU mb/I3/Zfyp4XWeOVL/Ws+CMGU6hw4v1iMab50gkWn76pr7kKrAoUFGn5EWJmLYNSpHTbLi +N3kGjnHLfaRSO0dgymc39aeJQtghpc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748968928; a=rsa-sha256; cv=none; b=hp8bGSBLjBOmu/rl8ATK1N30iAV5hdv/kXLgLbGGoU5fB+cdtTEy4FGAXYW5u37+0l4s4Z st2oMXuC8F5lXC/0AnIF52yM4yB0aij6DZ4Jb8gvbYEF6i2k1bsqyefHulsYOxuhjEUgqV A+UDjIlGIrXD7Oace+Z1q17GnEaNGqQ= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=CTGCW58v; spf=pass (imf28.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1748968927; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XW5iA+IcoMvqOf0ndcrgPMgu4TWbTOSU4Ckyq0Vi+RE=; b=CTGCW58vP4drg/zcLkrf/GXhq0MbaxtDA4fNHKJObZ5G6CaxvUVb1IBEp0WxjrTzP0Z+Tx zeCAW9eauvev8oog7QkUFOGFz9bJMxekAs+p+KcF8n7keKZYeSnLypIUV6JUaJlXMSuJDz 0ksjr+m67QQU5G+mm3zX+RIrNpLLiPc= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-587-K_7AEqg_MKCk3rrLyKlZCg-1; Tue, 03 Jun 2025 12:42:06 -0400 X-MC-Unique: K_7AEqg_MKCk3rrLyKlZCg-1 X-Mimecast-MFC-AGG-ID: K_7AEqg_MKCk3rrLyKlZCg_1748968925 Received: by mail-qk1-f198.google.com with SMTP id af79cd13be357-7c955be751aso927816185a.2 for ; Tue, 03 Jun 2025 09:42:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748968925; x=1749573725; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XW5iA+IcoMvqOf0ndcrgPMgu4TWbTOSU4Ckyq0Vi+RE=; b=XY4AOOMfi4a/tBfFLtuYJDf77FfvfdKi7NQyfFt/+KcYNgBBaxOhF/OIYNBJp5sT7h 7n/A1WleqkdZjlg21JTcXGGQfBpjyQYgm1P8Yt3ks0Yva/1ejv+Y84cQmqWxrLE8Y4XS N4pPG8Y0dONCNNc6TJg3IJ+YNg4TN5n9+T8jZoVlM8ANjR+jaO5SaUcrBibpD4hC1gsT Gwypz3/MXPhp8E/xSXRvnmUK1b2/kohgqzxv5ODEr2vcktVU89fnT6HpBb2MN1qavWiI lU79Bbwq32laqI/cVSzlGDG66fPv512ZFF1zFANdrqKxFWk3+gU/P3YrzUJqQ5CB6n7V Z3iQ== X-Forwarded-Encrypted: i=1; AJvYcCVuNGtOamJXiLhEjY3yIqOKLgHWz5jwN0rthYn/1stuV89Jp53+VD1na3DWvGBrlSPvlLnICJM9pg==@kvack.org X-Gm-Message-State: AOJu0Yzesg74qethBZL2Yxrb25ydM/pk+xhuUaEGLINpiPBXqV08noAc oC+etkhVSHN9EkVcA+Fd663zi1lDZ50iDw8QoRjrJGJFco6uOpqW6lpyjnUSo1g1qfGBJMB7Hf/ I8ZRVHKDpPTAtZoZNiHCiAh7s3V7j49vPx4yYXxeSHHr4ZcxSM3ML X-Gm-Gg: ASbGnctINqLRpWnZYi9O27U8u7Jd/nFsuSao34cqCJQBGdOn7V7Cp2+az/EdEsrekPY vhvT/P656+RT/GL08u7x0KQyl/rtpFFGImsMlS56tYuEkABpLg9Z/jwptV2oBhXWm4tKA9BBhIW f23WVCCov2jFUAay2oHPmur0cHzl5e5ce2V/tRwFaMNCMId5uzvdrR9Plqv+opnetrkPAdnVCL9 mXGua1z8iLhUcUGgMRMXN2QHWyA7GpS/JTVzOD/XFHWFOjEeOUzFHsO4z2leA6HheZUdm6ygYZ/ 4ng= X-Received: by 2002:a05:620a:190c:b0:7d0:9da7:6941 with SMTP id af79cd13be357-7d0a201df56mr2833923185a.37.1748968924976; Tue, 03 Jun 2025 09:42:04 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHNlgjeVsUhFeMqMzPsr8hOvznmRm5eUNpcsj+yPaaFc3gu0L6wkGYTMZrQGhj5zHFGM43GxA== X-Received: by 2002:a05:620a:190c:b0:7d0:9da7:6941 with SMTP id af79cd13be357-7d0a201df56mr2833919985a.37.1748968924554; Tue, 03 Jun 2025 09:42:04 -0700 (PDT) Received: from x1.local ([85.131.185.92]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6fac6d33901sm82985226d6.18.2025.06.03.09.42.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Jun 2025 09:42:04 -0700 (PDT) Date: Tue, 3 Jun 2025 12:42:01 -0400 From: Peter Xu To: Kairui Song Cc: Barry Song <21cnbao@gmail.com>, linux-mm@kvack.org, Andrew Morton , Suren Baghdasaryan , Andrea Arcangeli , David Hildenbrand , Lokesh Gidra , stable@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3] mm: userfaultfd: fix race of userfaultfd_move and swap cache Message-ID: References: <20250602181419.20478-1-ryncsn@gmail.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: lDaLixBALp6ChAfoZyx6a2pXRLnnMXtijJ2Gjicdry4_1748968925 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E4BA3C0008 X-Stat-Signature: m7u5nj6iprat43s57xm8116k7wxbppcj X-Rspam-User: X-HE-Tag: 1748968927-33371 X-HE-Meta: U2FsdGVkX1+Mk2ZxRAsuxwhxtsxjHu9ib3IP2erez2yEpmwCnKCA/wD4YcvaJQVVGNdkUoY9qA8zi5y/MvNHzWuhQjlxNMCrv+ip1YlofgFEoDbEmC3nSuL4f7Tn5JQ8R0tlmWHZTXgnEd6fB4XFhAzZaFR4C00dZS0YeJQpFlDzseEPm7JGLpLYex3JboJfw4AyFgu4wPYkV1xBpd84jj2y2qylqjbQI4HmP9+dFMdD9nOO+sAGEagjfZrqUqhDZTo3O/qMyIBnGxjhYf5iZQUONU2tI6lRCOF+i8PXX2RGOk+BhFO8s5V9qqQQntRNDVlV5ww1Mh5RuDEG5jj52bmlXWrdcCMRXx+ESO1rNIPjjmt5imOUVk8XVfALtkiSvhk2Zq4wdWUWpC37bRfXWiyhStk39paB+oK0c7M71y8q81gI9AYxVubKZOY9xeEOu+oQ59ocMsF8dt3RrFv4ZnHYkj91J0FA3QHXp5D1KI+RKZ8E2r2awEvbhxZlGDIwkndGmqZtntIDuVm9L+wf3tXFusFDpV2C5zWOegor+6sBIAtCBVtw7JZJQLjh8V0aME8IOWrCUhRO9lidPJbl0LWMwCoDgRO1mRPGUzXbtptddBqfoVxVhBe/2N6OxNX3gn6A22QmI1WzRLC09FUP0F1mw27vVZAwA35iO0rxQMOUTHeVxxM9kqC2mdkZkPpPwOULrZFf5YwbcWpqXNlNSCyRTMF/0xL7AulR0HcrDVddwzc906tO8sPtYEOv8Y2dtCGTZRsgony3uMnSZ68a0f4PZ5x5wqopYUkY6E13muFIjoinHGwd2fvVkYo8OvptEUz30zBw9EHOy0IZqBQ/m8+ctV7WGcpc+rs+iD0Eb1twZUa1FUiBEEqoXhESUIgf89KBh2a/L1SXsDhMO+uL3JklwQRNdBS5Z2bXt0DXRN9liXzwf2PkJFxQaLJBu0aqQPkWCsTvB4swdkFOYXd nOWngXSe 7ezGcbvggoo6aXFx3s9uT9Eszb+qg3wN+DMLuA/xpfa84XUgASqNWabW7Bw7riIO6u4Qp3IGqAVCnveQDBzgcVjBnBStWq4hyKLM9ibEGbVjpZajng+BCzvIsDeNMxsqdN6JbQ7N7DEhp6sJTPRml9aE4fJ4XK1gvXcLSjemVDsBXN0SUAhW1Rbc6lAtU4XzUBWBw4sbv3dzrQkLopTpJU7X+/QjCCsrx8R48uArwu80Vcv2zBIP2TeeSwxvD2PEU9/FVYs5xK0XvetXm4rJ8ZQ94rpbVie4QDU+51edJXO66d90kbcvaPddHzvanVkUVK3iNcvYCxar5Glm1qkor1Y5/G6lcW6dyG3WNY87kMeyLzKQvFWMsszwr0cZASf8oqsgEFsDlxUSudD6T1KoHSFViBTks2Vxm9nslM32qbAo4byXvroOoewfQxGhyAj/svfzYTi4Z1zGiHlZT3RgNMCDFGJ1UfqwI8ZiPbfn2g+ZNp4cvviLlYrBCu+76qYcxhT0C X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 03, 2025 at 07:48:49PM +0800, Kairui Song wrote: > On Tue, Jun 3, 2025 at 6:08 AM Barry Song <21cnbao@gmail.com> wrote: > > > > On Tue, Jun 3, 2025 at 8:34 AM Peter Xu wrote: > > > > > > On Tue, Jun 03, 2025 at 02:14:19AM +0800, Kairui Song wrote: > > > > From: Kairui Song > > > > > > > > On seeing a swap entry PTE, userfaultfd_move does a lockless swap cache > > > > lookup, and try to move the found folio to the faulting vma when. > > > > Currently, it relies on the PTE value check to ensure the moved folio > > > > still belongs to the src swap entry, which turns out is not reliable. > > > > > > > > While working and reviewing the swap table series with Barry, following > > > > existing race is observed and reproduced [1]: > > > > > > > > ( move_pages_pte is moving src_pte to dst_pte, where src_pte is a > > > > swap entry PTE holding swap entry S1, and S1 isn't in the swap cache.) > > > > > > > > CPU1 CPU2 > > > > userfaultfd_move > > > > move_pages_pte() > > > > entry = pte_to_swp_entry(orig_src_pte); > > > > // Here it got entry = S1 > > > > ... < Somehow interrupted> ... > > > > > > > > // folio A is just a new allocated folio > > > > // and get installed into src_pte > > > > > > > > // src_pte now points to folio A, S1 > > > > // has swap count == 0, it can be freed > > > > // by folio_swap_swap or swap > > > > // allocator's reclaim. > > > > > > > > // folio B is a folio in another VMA. > > > > > > > > // S1 is freed, folio B could use it > > > > // for swap out with no problem. > > > > ... > > > > folio = filemap_get_folio(S1) > > > > // Got folio B here !!! > > > > ... < Somehow interrupted again> ... > > > > > > > > // Now S1 is free to be used again. > > > > > > > > // Now src_pte is a swap entry pte > > > > // holding S1 again. > > > > folio_trylock(folio) > > > > move_swap_pte > > > > double_pt_lock > > > > is_pte_pages_stable > > > > // Check passed because src_pte == S1 > > > > folio_move_anon_rmap(...) > > > > // Moved invalid folio B here !!! > > > > > > > > The race window is very short and requires multiple collisions of > > > > multiple rare events, so it's very unlikely to happen, but with a > > > > deliberately constructed reproducer and increased time window, it can be > > > > reproduced [1]. > > > > > > > > It's also possible that folio (A) is swapped in, and swapped out again > > > > after the filemap_get_folio lookup, in such case folio (A) may stay in > > > > swap cache so it needs to be moved too. In this case we should also try > > > > again so kernel won't miss a folio move. > > > > > > > > Fix this by checking if the folio is the valid swap cache folio after > > > > acquiring the folio lock, and checking the swap cache again after > > > > acquiring the src_pte lock. > > > > > > > > SWP_SYNCRHONIZE_IO path does make the problem more complex, but so far > > > > we don't need to worry about that since folios only might get exposed to > > > > swap cache in the swap out path, and it's covered in this patch too by > > > > checking the swap cache again after acquiring src_pte lock. > > > > > > [1] > > > > > > > > > > > Testing with a simple C program to allocate and move several GB of memory > > > > did not show any observable performance change. > > > > > > > > Cc: > > > > Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") > > > > Closes: https://lore.kernel.org/linux-mm/CAMgjq7B1K=6OOrK2OUZ0-tqCzi+EJt+2_K97TPGoSt=9+JwP7Q@mail.gmail.com/ [1] > > > > Signed-off-by: Kairui Song > > > > > > > > --- > > > > > > > > V1: https://lore.kernel.org/linux-mm/20250530201710.81365-1-ryncsn@gmail.com/ > > > > Changes: > > > > - Check swap_map instead of doing a filemap lookup after acquiring the > > > > PTE lock to minimize critical section overhead [ Barry Song, Lokesh Gidra ] > > > > > > > > V2: https://lore.kernel.org/linux-mm/20250601200108.23186-1-ryncsn@gmail.com/ > > > > Changes: > > > > - Move the folio and swap check inside move_swap_pte to avoid skipping > > > > the check and potential overhead [ Lokesh Gidra ] > > > > - Add a READ_ONCE for the swap_map read to ensure it reads a up to dated > > > > value. > > > > > > > > mm/userfaultfd.c | 23 +++++++++++++++++++++-- > > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > index bc473ad21202..5dc05346e360 100644 > > > > --- a/mm/userfaultfd.c > > > > +++ b/mm/userfaultfd.c > > > > @@ -1084,8 +1084,18 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, > > > > pte_t orig_dst_pte, pte_t orig_src_pte, > > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl, > > > > - struct folio *src_folio) > > > > + struct folio *src_folio, > > > > + struct swap_info_struct *si, swp_entry_t entry) > > > > { > > > > + /* > > > > + * Check if the folio still belongs to the target swap entry after > > > > + * acquiring the lock. Folio can be freed in the swap cache while > > > > + * not locked. > > > > + */ > > > > + if (src_folio && unlikely(!folio_test_swapcache(src_folio) || > > > > + entry.val != src_folio->swap.val)) > > > > + return -EAGAIN; > > > > + > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte, > > > > @@ -1102,6 +1112,15 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, > > > > if (src_folio) { > > > > folio_move_anon_rmap(src_folio, dst_vma); > > > > src_folio->index = linear_page_index(dst_vma, dst_addr); > > > > + } else { > > > > + /* > > > > + * Check if the swap entry is cached after acquiring the src_pte > > > > + * lock. Or we might miss a new loaded swap cache folio. > > > > + */ > > > > + if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) { > > > > > > Do we need data_race() for this, if this is an intentionally lockless read? > > > > Not entirely sure. But I recommend this pattern, borrowed from AFAIU data_race() is only for KCSAN. READ_ONCE/WRITE_ONCE will also be needed. The doc actually explicitly mentioned this case: /* * ... * be atomic *and* KCSAN should ignore the access, use both data_race() * and READ_ONCE(), for example, data_race(READ_ONCE(x)). */ #define data_race(expr) \ I'm ok if there're existing references of swap_map[], so even if it's good to have data_race() we can do that later until someone complains at all the spots.. > > zap_nonpresent_ptes() -> free_swap_and_cache_nr(), > > where the PTL is also held and READ_ONCE is used. > > > > if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { > > .. > > nr = __try_to_reclaim_swap(si, offset, > > TTRS_UNMAPPED | TTRS_FULL); > > > > if (nr == 0) > > nr = 1; > > else if (nr < 0) > > nr = -nr; > > nr = ALIGN(offset + 1, nr) - offset; > > } > > Thanks for the explanation, I also agree that holding PTL here is good > enough here. > > > > > I think we could use this to further optimize the existing > > filemap_get_folio(), since in the vast majority of cases we don't > > have a swapcache, yet we still always call filemap_get_folio(). > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..c527ec73c3b4 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > > > @@ -1388,7 +1388,7 @@ static int move_pages_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t *src_pmd, > > * folios in the swapcache. This issue needs to be resolved > > * separately to allow proper handling. > > */ > > > > - if (!src_folio) > > + if (!src_folio & (swap_map[offset] & SWAP_HAS_CACHE)) > > folio = filemap_get_folio(swap_address_space(entry), > > swap_cache_index(entry)); > > if (!IS_ERR_OR_NULL(folio)) { > > > > To be future-proof, we may want to keep the READ_ONCE to ensure > > the compiler doesn't skip the second read inside move_swap_pte(). > > Maybe we can do this optimization in another patch I think. > > > > > > > > > Another pure swap question: the comment seems to imply this whole thing is > > > protected by src_pte lock, but is it? > > > > > > I'm not familiar enough with swap code, but it looks to me the folio can be > > > added into swap cache and set swap_map[] with SWAP_HAS_CACHE as long as the > > > folio is locked. It doesn't seem to be directly protected by pgtable lock. > > > > > > Perhaps you meant this: since src_pte lock is held, then it'll serialize > > > with another thread B concurrently swap-in the swap entry, but only _later_ > > > when thread B's do_swap_page() will check again on pte_same(), then it'll > > > see the src pte gone (after thread A uffdio_move happened releasing src_pte > > > lock), hence thread B will release the newly allocated swap cache folio? > > > > > > There's another trivial detail that IIUC pte_same() must fail because > > > before/after the uffdio_move the swap entry will be occupied so no way to > > > have it reused, hence src_pte, even if re-populated again after uffdio_move > > > succeeded, cannot become the orig_pte (points to the swap entry in > > > question) that thread B read, hence pte_same() must check fail. > > > > in v1 of this patch, we had some similar discussions [1][2]: > > > > [1] https://lore.kernel.org/linux-mm/CAGsJ_4wBMxQSeoTwpKoWwEGRAr=iohbYf64aYyJ55t0Z11FkwA@mail.gmail.com/ > > [2] https://lore.kernel.org/linux-mm/CAGsJ_4wM8Tph0Mbc-1Y9xNjgMPL7gqEjp=ArBuv3cJijHVXe6w@mail.gmail.com/ > > > > At the very least, [2] is possible, although the probability is extremely low. > > > > "It seems also possible for the sync zRAM device. > > > > step 1: src pte points to a swap entry S without swapcache > > step 2: we call move_swap_pte() > > step 3: someone swap-in src_pte by sync path, no swapcache; swap slot > > S is freed. > > -- for zRAM; > > step 4: someone swap-out src_pte, get the exactly same swap slot S as step 1, > > adds folio to swapcache due to swapout; > > step 5: move_swap_pte() gets ptl and finds page tables are stable > > since swap-out > > happens to have the same swap slot as step1; > > step 6: we clear src_pte, move src_pte to dst_pte; but miss to move the folio. > > > > Yep, we really need to re-check pte for swapcache after holding PTL. > > " > > > > Personally, I agree that improving the changelog or the comments > > would be more helpful. In fact, there are two bugs here, and Kairui’s > > changelog clearly describes the first one. > > Yeah, the first one is quite a long and complex race already, so I > made the description on the second issue shorter. I thought it > wouldn't be too difficult to understand given the first example. I can > add some more details. IMHO it's not about how the race happens that is hard to follow. To me, it's harder to follow on how src_pte lock can prove READ_ONCE() of swap_map[] is valid. Especially, on why it's okay to read even false negative (relies on the other thread retry properly in do_swap_page). It'll be much appreciated if you could add something into either comments or git commit message on this. Maybe a link to this specific lore discussion (or v1's) would be helpful. Thanks, -- Peter Xu