From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1955FC5B543 for ; Fri, 30 May 2025 19:24:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 925676B0195; Fri, 30 May 2025 15:24:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D5D06B019E; Fri, 30 May 2025 15:24:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 79D5E6B019F; Fri, 30 May 2025 15:24:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 51AC06B0195 for ; Fri, 30 May 2025 15:24:56 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BC9A01214DA for ; Fri, 30 May 2025 19:24:55 +0000 (UTC) X-FDA: 83500551750.08.B5685CA Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170]) by imf07.hostedemail.com (Postfix) with ESMTP id 99CF040010 for ; Fri, 30 May 2025 19:24:53 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Kx1wMY7t; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748633093; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3x+XwOs38yhaUq7Mf1YCL8Z3w+p+5XhoJf0uNSubdMw=; b=2dyjWRKEP/UReRQ3/WgYHlW2K6bQ17BSo36eapoGGLhBFQpaRuVLxAWejytDruCAwhXSUz 0TKnGFtd1p8ftPizmX1aAqtkQQqf+qq024ZcY/Gj8PsB9uiDVj0jole0+113Z7SNxNXF0w JyVMI2I5zvdpw/WgtevM5taJfu4I9k8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748633093; a=rsa-sha256; cv=none; b=P4gO5auFmtc6DDuP9whduPz0KhY/GRrF1N2ZFGivyeHxne1Ae32JfVkGd7JmFdwUeSICmf neRWQI3NJgrY2cQ+yXMKAurDB/HVg3nOTzzM3WaA5hEqULcMPdsD16Vz1ZanXdFcmcdSUo Z34AYqwrN6RfMaUYtge4APV6U2UFDJI= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Kx1wMY7t; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-lj1-f170.google.com with SMTP id 38308e7fff4ca-32a696ff4dcso21319491fa.3 for ; Fri, 30 May 2025 12:24:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748633092; x=1749237892; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3x+XwOs38yhaUq7Mf1YCL8Z3w+p+5XhoJf0uNSubdMw=; b=Kx1wMY7t8bow9NwuvO7IP4zL045/Bw9dt4ZJIMQ98nlqJTjTR3fiBCdhundgCxREEg MdW5oY5xLTLWjbB7pKdnGDEAn1K87A+f/fof554z2J5XaQnta0cL46zQparxFSt96pCN EP0bcZmxxxv8bsw+XmfhGNxviEsoE9S+Iy4edLw0YRl4vDwvvd1BTSctGutyhsXkOXQs naeQKdRAA5lCJ+E596oacIT2MBLVJov6I3QWzi3NteFykrgI8Mlav3p5Obe7FnauvY6U TAz9X6t7wa8+x0VsvGIRPbSMsWO45spmc7wGo+U7PI314+HxJchF7mIBwlQB7P6uC6e6 XrIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748633092; x=1749237892; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3x+XwOs38yhaUq7Mf1YCL8Z3w+p+5XhoJf0uNSubdMw=; b=GpzCXZRnOiLJqP6HRZKIkiiejlONYL/wETUMYwXwR45XyNMA7sVj5I4lf3g7tgczzL AvVRtIsUg+hx5bCCfNzEDzeqUDzo2IDtE5CO3pm/oE3fIgqkfSGrcp0kdcab9ZGA8EuJ 7ATBmn+Geq3IVF4yYzuRK4UTt78aU4+tg9iGkW2ByleXEJE1EB9Hzc5BatWtYlWlwwU+ bspBM+1f909qdZfYYeFtP4zNpeN/BzvfNfkrLwENU4BWnrbdqGWNBGUMlaKqHkPpWCkm aSH5mFC1/ceMv6FMvBJ3Zzosxqegmt+F1OiBbsCPel9OY7MJu7Ud5h8no5l28NGEN7iw pW7A== X-Forwarded-Encrypted: i=1; AJvYcCXFiXuxlEysl0TaigVnBxKRZEPeA44uBJmmLgbe4DIzQPlcShLiZ0o4YLoKwn/BA3GeUx+tX65noQ==@kvack.org X-Gm-Message-State: AOJu0YwIWvUtTHfmEaW84aeCVcpfex0Yn7V627FNyMaIVWidrpSKZB4e dJuVnRiohkpTjZMGRGBw5d6Z9rImPOE9u8xDZz0R3nNdafZ1/WePNtw25pgBBgQBSQMctDvy1nd dGbEoPsF/iLbelLA4sU9PKYfs1Z7rEw4= X-Gm-Gg: ASbGncsmAlb4dq2OGfc34lOzaR4PmkIrDhZLVb6iiuwa9JPD3ykM+cxXXAYCTKuDpdM rmeU43ftpmfsl8d2uyq1CqZPZ459iSKt8PVo20ahMWKqhG0vtjLNYYOR+dR3wfqrGI4zCgUlk+M yI+pK4dUoyh9/UM357bd2FYQ5oDSH938OENQ3cUprw1C4= X-Google-Smtp-Source: AGHT+IFZyLsLaTu5kEJeN7ag0B1DXxz1fOh2qsr6VsO0/H+k5BZLIsvQRcYDkt5MugmeGiNK2AzUsFh9j4kxz2Z2VN0= X-Received: by 2002:a2e:be08:0:b0:32a:7979:47c3 with SMTP id 38308e7fff4ca-32a8cd52862mr17933541fa.11.1748633091194; Fri, 30 May 2025 12:24:51 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Kairui Song Date: Sat, 31 May 2025 03:24:33 +0800 X-Gm-Features: AX0GCFsTtMA00jY9PQv2Ds1-E9ku6iax6UhhwZbgbdVWaUbwUe8j-aTdr59syqw Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 99CF040010 X-Stat-Signature: 1ndh7suoc7zh4yqfnkd3atkzyh1i9b81 X-Rspam-User: X-HE-Tag: 1748633093-682841 X-HE-Meta: U2FsdGVkX1/eKni024IQIFJfFWLoYqeCt1+TlJLIDdsoyZWnjGy/ZhN6to4brBxJ8GobJ+fVD950ascHkynsx+AaV41LMRPd22lxtC2Zrc7Q7A8WIVkZYBTqdt4GdTPD1BCcjkFOdc/4wy5WUBUzyu93L+1nTAYX99p8Lsh1ILRkkgdvJGYr74LZe3Jd8v9oEGIOPNImTPK78ZSvnTFrZ5U3tlSY5AHbZ7D1KQTudtqLID729pWXCQNAo65LHF2ZTCeXAwe6XcoTkaOwREyfihwxKzWuz6m8g5CmBEieNwnebOxEZ+FqLVZjl63JvzDfBsdOxwkegP2/NWG3IKRVS8f55NSNP4uLu3jMkjiVp6LX0CrynJ4qVTTlttoPjLox+2mG+pLuScAdUtSKf+YQHLhw3l/7q7YdRC5seV6Uav9/TFxcYtMz9v6Ie4mHqceXc1uWC3czshXz9sxOYZ9HiovyfKcyQiM/TQLdlTCK8Jf14f+b4J1zaMtFqSL13Z+T4C4OsA98lTwTLmInnBQ6wi+A+eYUYUL1OmQvp6dXCDfxnR8moHN2J5znq5ZMBkWgA9Gfn6UEZsTwj6Ew02c5MqGg3+7bqvI07GQ+m8sm4m+4jXahPvr4I6ImBdK8CgJop2q78ExGKXigHbW01Jqa09cPviDlXrQ8b7MS7tUbQtdFnaEJ+fwtAWegSh6NZ3XtzFD+ev4sk15G8774UFlzXWFd1bi7HoNkinlZexI3SkO4kPJUxf/Es5mrhFyFqVjx60syRjI97rsAD+Aih2BfqWnxgG3qEmU8S7bLzwoO9J7g4dMlgqe42otfL3UKMTvd0TBvIr3DFHvzl9UDRJLZ2hJofpgjPgNQorHUOJ31KNjpyEjSz/5gvahjtV7i8wSyB5SsnzYUTjCtU7L+qYSrg+3Q7dMPOvyIsjoCSjLiYF+uh3ZS3ycGRJIgRAj/tij3pyPoSZbXCAipLQqO8Jn 1xSgMs2f T0VHHQrRWu8qiMRDjw5QcwD1C2ItGzv8YMoEM0e8T1sVxkpppNrhIQpdh/qI5fiUm0lgiIsVVHDH09+Z1yP8hXAmRsFkTRg/zX/qaJm1p0247Pl83Ojj8US/lzV8VRyL/06zsx7oLIaXSBtyu4isKZmY9HTZEj8reZq4sLu2+g4QaVBvQOlKFepLa+mI20/PltcOPRgDgKEtbBvS8S0SmPS0fr7i0J8plspN4hc9ACeRn+Oj2urH8/4gvTrVyW/5/ZapWVvgQgTp/+yiiOYI8s5h9hhztNeLkXCVKOecm+YiWtNwsP64uSUoO7B28lgkl3rWYPwZu0sa580QUvEw5CqERLDKv5BhhddcpovgHLET6Xvp35J2mWUj4ZtCUFxgvlktmM9bS2lemAx6apKfQuuOxwg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 4:49=E2=80=AFPM Kairui Song wrot= e: > > On Tue, May 27, 2025 at 11:11=E2=80=AFPM Kairui Song w= rote: > > > > On Tue, May 27, 2025 at 3:59=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Sat, May 24, 2025 at 8:01=E2=80=AFAM Kairui Song wrote: > > > > > > > > On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.= com> wrote: > > > > > > > > > > On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C= =8821=E6=97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > > > > Let me run test case [1] to check whether this ever happens. = I guess I need to > > > > > > > hack kernel a bit to always add folio to swapcache even for S= YNC IO. > > > > > > > > > > > > That will cause quite a performance regression I think. Good th= ing is, > > > > > > that's exactly the problem this series is solving by dropping t= he SYNC > > > > > > IO swapin path and never bypassing the swap cache, while improv= ing the > > > > > > performance, eliminating things like this. One more reason to j= ustify > > > > > > the approach :) > > > > > > > > Hi Barry, > > > > > > > > > > > > > > I attempted to reproduce the scenario where a folio is added to t= he swapcache > > > > > after filemap_get_folio() returns NULL but before move_swap_pte() > > > > > moves the swap PTE > > > > > using non-synchronized I/O. Technically, this seems possible; how= ever, > > > > > I was unable > > > > > to reproduce it, likely because the time window between swapin_re= adahead and > > > > > taking the page table lock within do_swap_page() is too short. > > > > > > > > Thank you so much for trying this! > > > > > > > > I have been trying to reproduce it too, and so far I didn't observe > > > > any crash or warn. I added following debug code: > > > > > > > > static __always_inline > > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned lon= g dst_end) > > > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *m= m, > > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > pmd_t dummy_pmdval; > > > > pmd_t dst_pmdval; > > > > struct folio *src_folio =3D NULL; > > > > + struct folio *tmp_folio =3D NULL; > > > > struct anon_vma *src_anon_vma =3D NULL; > > > > struct mmu_notifier_range range; > > > > int err =3D 0; > > > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *= mm, > > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > if (!src_folio) > > > > folio =3D filemap_get_folio(swap_address_sp= ace(entry), > > > > swap_cache_index(entry)); > > > > + udelay(get_random_u32_below(1000)); > > > > + tmp_folio =3D filemap_get_folio(swap_address_space(= entry), > > > > + swap_cache_index(entry)); > > > > + if (!IS_ERR_OR_NULL(tmp_folio)) { > > > > + if (!IS_ERR_OR_NULL(folio) && tmp_folio != =3D folio) { > > > > + pr_err("UFFDIO_MOVE: UNSTABLE folio > > > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio, > > > > tmp_folio->swap.val); > > > > + } > > > > + folio_put(tmp_folio); > > > > + } > > > > if (!IS_ERR_OR_NULL(folio)) { > > > > if (folio_test_large(folio)) { > > > > err =3D -EBUSY; > > > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *m= m, > > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_ad= dr, > > > > dst_pte, src_pte, > > > > orig_dst_pte, orig_src_pte, dst_pmd= , dst_pmdval, > > > > dst_ptl, src_ptl, src_folio); > > > > + if (tmp_folio !=3D folio && !err) > > > > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed > > > > check: %lx -> %lx\n", folio, tmp_folio); > > > > } > > > > > > > > And I saw these two prints are getting triggered like this (not a r= eal > > > > issue though, just help to understand the problem) > > > > ... > > > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) -> > > > > fffffdffc7ccac80 (51) > > > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) -> > > > > fffffdffc3435e00 (3b) > > > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) -> > > > > fffffdffc34ab8c0 (76) > > > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) -> > > > > fffffdffc34ab8c0 (6d) > > > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) -> > > > > fffffdffc34eda80 (32) > > > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) -> > > > > fffffdffc38cedc0 (b) > > > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffff= dffc34db140 > > > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) -> > > > > fffffdffc3465cc0 (21) > > > > ... > > > > > > > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) -> > > > > fffffdffc3853540 (53)" worries me at first. On first look it seems = the > > > > folio is indeed freed completely from the swap cache after the firs= t > > > > lookup, so another swapout can reuse the entry. But as you mentione= d > > > > __remove_mapping won't release a folio if the refcount check fails,= so > > > > they must be freed by folio_free_swap or __try_to_reclaim_swap, the= re > > > > are many places that can happen. But these two helpers won't free a > > > > folio from swap cache if its swap count is not zero. And the folio > > > > will either be swapped out (swap count non zero), or mapped (freein= g > > > > it is fine, PTE is non_swap, and another swapout will still use the > > > > same folio). > > > > > > > > So after more investigation and dumping the pages, it's actually th= e > > > > second lookup (tmp_folio) seeing the entry being reused by another > > > > page table entry, after the first folio is swapped back and release= d. > > > > So the page table check below will always fail just fine. > > > > > > > > But this also proves the first look up can see a completely irrelev= ant > > > > folio too: If the src folio is swapped out, but got swapped back an= d > > > > freed, then another folio B shortly got added to swap cache reuse t= he > > > > src folio's old swap entry, then the folio B got seen by the look u= p > > > > here and get freed from swap cache, then src folio got swapped out > > > > again also reusing the same entry, then we have a problem as PTE se= ems > > > > untouched indeed but we grabbed a wrong folio. Seems possible if I'= m > > > > not wrong: > > > > > > > > Something like this: > > > > CPU1 CPU2 > > > > move_pages_pte > > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > > | Got Swap Entry S1 from src_pte > > > > ... > > > > > > > > > > I=E2=80=99m assuming you mean ``, sinc= e I=E2=80=99m not > > > sure where folio B comes from in the statement ` > > swap out folio B>`. > > > > > > If that assumption is correct, and folio A is still in the swapcache, > > > how could someone swap in folio B without hitting folio A? That would > > > suggest folio A must have been removed from the swapcache earlier=E2= =80=94right? > > > > > > > > > > > > > > > Sorry my bad, I think I made people think folio B is related to > > src_pte at this point. What I actually mean is that: Another random > > folio B, unrelated to src_pte, could got swapped out, and using the > > swap entry S1. > > > > > > > > > > ... > > > > folio =3D swap_cache_get_folio(S1) > > > > | Got folio B here !!! > > > > move_swap_pte > > > > > > > > | Holding a reference doesn't pi= n the cache > > > > | as we have demonstrated > > > > > > > > double_pt_lock > > > > is_pte_pages_stable > > > > | Passed because of S1 is reused > > > > folio_move_anon_rmap(...) > > > > | Moved invalid folio B here !!! > > > > > > > > But this is extremely hard to reproduce though, even if doing it > > > > deliberately... > > > > > > > > So I think a "folio_swap_contains" or equivalent check here is a go= od > > > > thing to have, to make it more robust and easier to understand. The > > > > checking after locking a folio has very tiny overhead and can > > > > definitely ensure the folio's swap entry is valid and stable. > > > > > > > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb0= 0" > > > > here might seem problematic, but it's still not a real problem. Tha= t's > > > > the case where the swapin in src region happens after the lookup, a= nd > > > > before the PTE lock. It will pass the PTE check without moving the > > > > folio. But the folio is guaranteed to be a completely new folio her= e > > > > because the folio can't be added back to the page table without > > > > holding the PTE lock, and if that happens the following PTE check h= ere > > > > will fail. > > > > > > > > So I think we should patch the current kernel only adding a > > > > "folio_swap_contains" equivalent check here, and maybe more comment= s, > > > > how do you think? > > > > > > The description appears to have some inconsistencies. > > > Would you mind rephrasing it? > > > > Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> > > fffffdffc385fb00" part first, as both you and me have come into a > > conclusion that "filemap_get_folio() returns NULL before > > move_swap_pte, but a folio was added to swap cache" is OK, and this > > output only proves that happens. > > > > So the problematic race is: > > > > Here move_pages_pte is moving src_pte to dst_pte, and it begins with > > src_pte =3D=3D swap entry S1, and S1 isn't cached. > > > > CPU1 CPU2 > > move_pages_pte() > > entry =3D pte_to_swp_entry(orig_src_pte); > > | src_pte is absent, and got entry =3D=3D S1 > > ... < Somehow interrupted> ... > > > > | folio A is just a new allocated fo= lio > > | for resolving the swap in fault. > > > > | swap in fault is resolved, src_pte > > | now points to folio A, so folio A > > | can get freed just fine. > > | And now S1 is free to be used > > | by anyone. > > > > | Folio B is a completely unrelated > > | folio swapped out by random proces= s. > > | (has nothing to do with src_pte) > > | But S1 is freed so it may use S1 > > | as its swap entry. > > > > ... > > folio =3D filemap_get_folio(S1) > > | The lookup is using S1, so it > > | got folio B here !!! > > ... < Somehow interrupted> ... > > > > | Folio B could fail to be swapped o= ut > > | or got swapped in again, so it can > > | be freed by folio_free_swap or > > | swap cache reclaim. > > | CPU1 is holding a reference but it > > | doesn't pin the swap cache folio > > | as I have demonstrated with the > > | test C program previously. > > | New S1 is free to be used again. > > > > | No thing blocks this from happenin= g > > | The swapout is still using folio A= , > > | and src_pte =3D=3D S1. > > folio_trylock(folio) > > move_swap_pte > > double_pt_lock > > is_pte_pages_stable > > | Passed because of S1 is reused so src_pte =3D=3D S1. > > folio_move_anon_rmap(...) > > | Moved invalid folio B here !!! > > > > It's a long and complex one, I don't think it's practically possible > > to happen in reality but in theory doable, once in a million maybe... > > Still we have to fix it, or did I got anything wrong here? > > Hi Barry, > > I managed to reproduce this issue, by hacking the kernel a bit (Only > adding only delay to increase the race window, and adding a WARN to > indicate the problem) > > 1. Applying following patch for kernel: > =3D=3D=3D > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..1d710adf9839 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > #include > #include > #include "internal.h" > @@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm, > struct vm_area_struct *dst_vma, > * occur and hit the swapcache after moving the PTE. > */ > if (src_folio) { > + if (WARN_ON(src_folio->swap.val !=3D > pte_to_swp_entry(orig_src_pte).val)) > + pr_err("Moving folio %lx (folio->swap =3D %lx), > orig_src_pte =3D %lx\n", > + (unsigned long)src_folio, src_folio->swap.= val, > + pte_to_swp_entry(orig_src_pte).val); > folio_move_anon_rmap(src_folio, dst_vma); > src_folio->index =3D linear_page_index(dst_vma, dst_addr)= ; > } > @@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm, > pmd_t *dst_pmd, pmd_t *src_pmd, > * folios in the swapcache. This issue needs to be resolv= ed > * separately to allow proper handling. > */ > + pr_err("DEBUG: Will do the lookup using entry %lx, > wait 3s...\n", entry.val); > + mdelay(1000 * 3); > if (!src_folio) > folio =3D filemap_get_folio(swap_address_space(en= try), > swap_cache_index(entry)); > + pr_err("DEBUG: Got folio value %lx, wait 3s...\n", > (unsigned long)folio); > + mdelay(1000 * 3); > if (!IS_ERR_OR_NULL(folio)) { > if (folio_test_large(folio)) { > err =3D -EBUSY; > > 2. Save following program in userspace (didn't bother with error check > for simpler code): > =3D=3D=3D > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > #define PAGE_SIZE 4096 > /* Need to consume all slots so define the swap device size here */ > #define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1) > > char *src, *race, *dst, *place_holder; > int uffd; > > void read_in(char *p) { > /* This test program initials memory with 0xAA to bypass zeromap */ > while (*((volatile char*)p) !=3D 0xAA); > } > > void *reader_thread(void *arg) { > /* Test requires kernel to wait upon uffd move */ > read_in(dst); > return NULL; > } > > void *fault_handler_thread(void *arg) { > int ret; > struct uffd_msg msg; > struct uffdio_move move; > struct pollfd pollfd =3D { .fd =3D uffd, .events =3D POLLIN }; > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > while (1) { > poll(&pollfd, 1, -1); > read(uffd, &msg, sizeof(msg)); > > move.src =3D (unsigned long)src + (msg.arg.pagefault.address - > (unsigned long)dst); > move.dst =3D msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > move.len =3D PAGE_SIZE; > move.mode =3D 0; > > ioctl(uffd, UFFDIO_MOVE, &move); > } > return NULL; > } > > int main() { > pthread_t fault_handler_thr, reader_thr; > struct uffdio_api uffdio_api =3D { .api =3D UFFD_API, .features =3D 0= }; > struct uffdio_register uffdio_register; > > src =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | > MAP_ANONYMOUS, -1, 0); > dst =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | > MAP_ANONYMOUS, -1, 0); > memset(src, 0xAA, PAGE_SIZE); > madvise(src, PAGE_SIZE, MADV_PAGEOUT); > > /* Consume all slots on swap device left only one entry (S1) */ > place_holder =3D mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ | > PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1); > madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT); > > /* Setup uffd handler and dst reader */ > uffd =3D syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > ioctl(uffd, UFFDIO_API, &uffdio_api); > uffdio_register.range.start =3D (unsigned long)dst; > uffdio_register.range.len =3D PAGE_SIZE; > uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; > ioctl(uffd, UFFDIO_REGISTER, &uffdio_register); > pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL); > pthread_create(&reader_thr, NULL, reader_thread, NULL); > > /* Wait for UFFDIO to start */ > sleep(1); > > /* Release src folio (A) from swap, freeing the entry S1 */ > read_in(src); > > /* Swapout another race folio (B) using S1 */ > race =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | > MAP_ANONYMOUS, -1, 0); > memset(race, 0xAA, PAGE_SIZE); > madvise(race, PAGE_SIZE, MADV_PAGEOUT); > > /* Wait for UFFDIO swap lookup to see the race folio (B) */ > sleep(3); > > /* Free the race folio (B) from swap */ > read_in(race); > /* And swap out src folio (A) again, using S1 */ > madvise(src, PAGE_SIZE, MADV_PAGEOUT); > > /* Kernel should have moved a wrong folio by now */ > > pthread_join(reader_thr, NULL); > pthread_cancel(fault_handler_thr); > pthread_join(fault_handler_thr, NULL); > munmap(race, PAGE_SIZE); > munmap(src, PAGE_SIZE); > munmap(dst, PAGE_SIZE); > close(uffd); > > return 0; > } > > 3. Run the test with (ensure no other swap device is mounted and > current dir is on a block device): > =3D=3D=3D > dd if=3D/dev/zero of=3Dswap.img bs=3D1M count=3D1; mkswap swap.img; swapo= n > swap.img; gcc test-uffd.c && ./a.out > > Then we get the WARN: > [ 348.200587] ------------[ cut here ]------------ > [ 348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104 > move_pages_pte+0xdb8/0x11a0 > [ 348.207544] Modules linked in: loop > [ 348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not > tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304 > PREEMPT(voluntary) > [ 348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS > edk2-stable202408-prebuilt.qemu.org 08/13/2024 > [ 348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYP= E=3D--) > [ 348.222013] pc : move_pages_pte+0xdb8/0x11a0 > [ 348.224062] lr : move_pages_pte+0x928/0x11a0 > [ 348.225881] sp : ffff800088b2b8f0 > [ 348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffb= c920000 > [ 348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc= 3e4dd40 > [ 348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c= 5a6d2c0 > [ 348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 000000000= 0000006 > [ 348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 000000000= 0000003 > [ 348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 000000000= 0000003 > [ 348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 000000000= 0000001 > [ 348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a475= 5424544 > [ 348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 000000000= 0000000 > [ 348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc= 335e4a8 > [ 348.258134] Call trace: > [ 348.259468] move_pages_pte+0xdb8/0x11a0 (P) > [ 348.261348] move_pages+0x3c0/0x738 > [ 348.262987] userfaultfd_ioctl+0x3d8/0x1f98 > [ 348.264916] __arm64_sys_ioctl+0x88/0xd0 > [ 348.266779] invoke_syscall+0x64/0xec > [ 348.268347] el0_svc_common+0xa8/0xd8 > [ 348.269967] do_el0_svc+0x1c/0x28 > [ 348.271711] el0_svc+0x40/0xe0 > [ 348.273345] el0t_64_sync_handler+0x78/0x108 > [ 348.274821] el0t_64_sync+0x19c/0x1a0 > [ 348.276117] ---[ end trace 0000000000000000 ]--- > [ 348.278638] Moving folio fffffdffc3e4dd40 (folio->swap =3D 0), orig_sr= c_pte =3D 1 > > That's the new added WARN, but the test program also hung with D > forever, and with errors with other tests like: > [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 > type:MM_ANONPAGES val:-1 > [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 > type:MM_SHMEMPAGES val:1 > > Because the kernel just moved the wrong folio, so unmap takes forever > looking for the missing folio, and counting went wrong too. > > So this race is real. It's extremely unlikely to happen because it > requires multiple collisions of multiple tiny race windows, however > it's not impossible. > > I'll post a fix very soon. On second thought, the "filemap_get_folio() returns NULL before move_swap_pte, but a folio was added to swap cache" case is also buggy. It can also be reproduced with the program above with slight modification: --- test-uffd.c 2025-05-30 08:34:00.485206529 +0000 +++ test-uffd-same-folio.c 2025-05-30 19:04:13.826078271 +0000 @@ -83,20 +83,20 @@ /* Release src folio (A) from swap, freeing the entry S1 */ read_in(src); - /* Swapout another race folio (B) using S1 */ + /* Swapout and free another race folio (B) forcing reclaiming S1 and folio (A) */ race =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(race, 0xAA, PAGE_SIZE); madvise(race, PAGE_SIZE, MADV_PAGEOUT); + read_in(race); + printf("RECLAMING A?\n"); - /* Wait for UFFDIO swap lookup to see the race folio (B) */ + /* Wait for UFFDIO swap lookup to see NULL */ sleep(3); - /* Free the race folio (B) from swap */ - read_in(race); /* And swap out src folio (A) again, using S1 */ madvise(src, PAGE_SIZE, MADV_PAGEOUT); - /* Kernel should have moved a wrong folio by now */ + /* Kernel should have moved folio (A) but it didn't */ pthread_join(reader_thr, NULL); pthread_cancel(fault_handler_thr); I'll fix them together.